Hi
I am running into trouble with a nested query using python. To try and debug
it, I first wrote the query I want using sqlite3
select freq.docid, freqTranspose.docid, sum(freq.count *
freqTranspose.count) from
Frequency as freq,
(select term, docid, count from Frequency) as freqTranspose
where freq.term = freqTranspose.term
group by freq.docid, freqTranspose.docid
;
Sparksql has trouble parsing the "(select ) as freqTranspose ³ line
Here is what my input data looks like
$ head -n 3 reuters.db.csv
docid,term,count
10000_txt_earn,net,1
10000_txt_earn,rogers,4
The output from sqlite3 is
$ head -n 6 3hSimilarityMatrix.slow.sql.out
freq.docid freqTranspose.docid sum(freq.count * freqTranspose.count)
-------------- ------------------- -------------------------------------
10000_txt_earn 10000_txt_earn 127
10000_txt_earn 10054_txt_earn 33
10000_txt_earn 10080_txt_crude 146
10000_txt_earn 10088_txt_acq 11
$
My code example pretty much follows
http://spark.apache.org/docs/latest/sql-programming-guide.html
dataFile = sc.textFile("reuters.db.csv²)
lines = dataFile.map(lambda l: l.split(",²))
def mapLines(line) :
ret = {}
ret['docid'] = line[0]
ret['term'] = line[1]
ret['count'] = line[2]
return ret
frequency = lines.map(mapLines)
schemaFrequency = sqlContext.inferSchema(frequency)
schemaFrequency.registerAsTable("frequency²)
Okay here is where I run into trouble
sqlCmd = "select \
freq.docid, \
freqTranspose.docid \
from \
frequency as freq, \
(select term, docid, count from frequency) \
"
similarities = sqlContext.sql(sqlCmd)
/Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib
/py4j-0.8.1-src.zip/py4j/protocol.py in get_return_value(answer,
gateway_client, target_id, name)
298 raise Py4JJavaError(
299 'An error occurred while calling {0}{1}{2}.\n'.
--> 300 format(target_id, '.', name), value)
301 else:
302 raise Py4JError(
Py4JJavaError: An error occurred while calling o40.sql.
: java.lang.RuntimeException: [1.153] failure: ``('' expected but `from'
found
select freq.docid, freqTranspose.docid
from frequency as freq, (select term, docid,
count from frequency)
Simple sql seems to ³parse² I.e. Select freq.docid from frequency as freq
Any suggestions would be greatly appreciated.
Andy
P.s. I should note, I think I am using version 1.0 ?