Re: SparkSql newbie problems with nested selects

Michael Armbrust Sun, 13 Jul 2014 13:19:22 -0700

Hi Andy,

The SQL parser is pretty basic (we plan to improve this for the 1.2
release).  In this case I think part of the problem is that one of your
variables is "count", which is a reserved word.  Unfortunately, we don't
have the ability to escape identifiers at this point.


However, I did manage to get your query to parse using the HiveQL parser,
provided by HiveContext.

hiveCtx.hql("""
select freq.docid, freqTranspose.docid, sum(freq.count *
freqTranspose.count) from
       Frequency freq JOIN
       (select term, docid, count from Frequency) freqTranspose
       where freq.term = freqTranspose.term
       group by freq.docid, freqTranspose.docid""")

Michael


On Sun, Jul 13, 2014 at 12:43 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi
>
> I am running into trouble with a nested query using python. To try and
> debug it, I first wrote the query I want using sqlite3
>
> select freq.docid, freqTranspose.docid, sum(freq.count *
> freqTranspose.count) from
>        Frequency as freq,
>        (select term, docid, count from Frequency) as freqTranspose
>        where freq.term = freqTranspose.term
>        group by freq.docid, freqTranspose.docid
>        ;
>
>
> Sparksql has trouble parsing the "(select ) as freqTranspose “ line
>
> Here is what my input data looks like
> $ head -n 3 reuters.db.csv
> docid,term,count
> 10000_txt_earn,net,1
> 10000_txt_earn,rogers,4
>
> The output from sqlite3 is
> $ head -n 6 3hSimilarityMatrix.slow.sql.out
> freq.docid      freqTranspose.docid  sum(freq.count * freqTranspose.count)
> --------------  -------------------  -------------------------------------
> 10000_txt_earn  10000_txt_earn       127
> 10000_txt_earn  10054_txt_earn       33
> 10000_txt_earn  10080_txt_crude      146
> 10000_txt_earn  10088_txt_acq        11
> $
>
>
> My code example pretty much follows
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> dataFile = sc.textFile("reuters.db.csv”)
> lines = dataFile.map(lambda l: l.split(",”))
> def mapLines(line) :
>     ret = {}
>     ret['docid'] = line[0]
>     ret['term'] = line[1]
>     ret['count'] = line[2]
>     return ret
> frequency = lines.map(mapLines)
> schemaFrequency = sqlContext.inferSchema(frequency)
> schemaFrequency.registerAsTable("frequency”)
>
> Okay here is where I run into trouble
>
> sqlCmd = "select \
>             freq.docid, \
>             freqTranspose.docid \
>           from \
>               frequency as freq, \
>               (select term, docid, count from frequency)  \
>         "
> similarities = sqlContext.sql(sqlCmd)
>
>
> /Users/andy/workSpace/dataBricksIntroToApacheSpark/USBStick/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)    298          
>        raise Py4JJavaError(    299                     'An error occurred 
> while calling {0}{1}{2}.\n'.--> 300                     format(target_id, 
> '.', name), value)    301             else:    302                 raise 
> Py4JError(
> Py4JJavaError: An error occurred while calling o40.sql.
> : java.lang.RuntimeException: [1.153] failure: ``('' expected but `from' found
>
> select             freq.docid,             freqTranspose.docid           from 
>               frequency as freq,               (select term, docid, count 
> from frequency)
>
>
>
> Simple sql seems to “parse” I.e. Select  freq.docid from frequency as freq
>
>
> Any suggestions would be greatly appreciated.
>
>
> Andy
>
>
> P.s. I should note, I think I am using version 1.0 ?
>
>
>
>

Re: SparkSql newbie problems with nested selects

Reply via email to