Hi 

I have a five node CDH 5.3 cluster running on CentOS 6.5, I also have a 
separate 
install of Spark 1.3.1. ( The CDH 5.3 install has Spark 1.2 but I wanted a 
newer version. )

I managed to write some Scala based code using a Hive Context to connect to 
Hive and 
create/populate  tables etc. I compiled my application using sbt and ran it 
with spark-submit
in local mode. 

My question concerns UDF's, specifically the function row_sequence function in 
the hive-contrib 
jar file i.e.  

hiveContext.sql("""

ADD JAR 
/opt/cloudera/parcels/CDH-5.3.3-1.cdh5.3.3.p0.5/jars/hive-contrib-0.13.1-cdh5.3.3.jar

  """)

hiveContext.sql("""

CREATE TEMPORARY FUNCTION row_sequence as 
'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';

  """)


     val resRDD = hiveContext.sql("""

          SELECT row_sequence(),t1.edu FROM
            ( SELECT DISTINCT education AS edu FROM adult3 ) t1
          ORDER BY t1.edu

                    """)

This seems to generate its sequence in the map (?) phase of execution because 
no matter how I fiddle 
with the main SQL I could not get an ascending index for dimension data. i.e. I 
always get 

1  val1
1  val2
1  val3

instead of 

1  val1
2  val2
3  val3

Im well aware that I can play with scala and get around this issue and I have 
but I wondered whether others
have come across this and solved it ? 

cheers

Mike F                                    

Reply via email to