Hi I have a five node CDH 5.3 cluster running on CentOS 6.5, I also have a separate install of Spark 1.3.1. ( The CDH 5.3 install has Spark 1.2 but I wanted a newer version. )
I managed to write some Scala based code using a Hive Context to connect to Hive and create/populate tables etc. I compiled my application using sbt and ran it with spark-submit in local mode. My question concerns UDF's, specifically the function row_sequence function in the hive-contrib jar file i.e. hiveContext.sql(""" ADD JAR /opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/hive-contrib-0.13.1-cdh5.3.3.jar """) hiveContext.sql(""" CREATE TEMPORARY FUNCTION row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence'; """) val resRDD = hiveContext.sql(""" SELECT row_sequence(),t1.edu FROM ( SELECT DISTINCT education AS edu FROM adult3 ) t1 ORDER BY t1.edu """) This seems to generate its sequence in the map (?) phase of execution because no matter how I fiddle with the main SQL I could not get an ascending index for dimension data. i.e. I always get 1 val1 1 val2 1 val3 instead of 1 val1 2 val2 3 val3 Im well aware that I can play with scala and get around this issue and I have but I wondered whether others have come across this and solved it ? cheers Mike F