Cool. Here is it how it goes... I am reading Avro objects from a Kafka topic as a DStream, converting it into a DataFrame so that I can filter out records based on some conditions and finally do some aggregations on these filtered records. During the process I also need to tag each record based on the value of a particular column, and for this I am iterating over Array[Row] returned by DataFrame.collect().
I am good as far as these things are concerned. The only thing which I am not getting is the reason behind changed column ordering within each Row. Say my actual record is [Tariq, IN, APAC]. When I do println(row.mkString("~")) it shows [IN~APAC~Tariq]. I hope I was able to explain my use case to you! [image: http://] Tariq, Mohammad about.me/mti [image: http://] <http://about.me/mti> On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasain...@gmail.com> wrote: > Hi Tariq, > > Can you tell in brief what kind of operation you have to do? I can try > helping you out with that. > In general, if you are trying to use any group operations you can use > window operations. > > On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <donta...@gmail.com> wrote: > >> Hi Sainath, >> >> Thank you for the prompt response! >> >> Could you please elaborate your answer a bit? I'm sorry I didn't quite >> get this. What kind of operation I can perform using SQLContext? It just >> helps us during things like DF creation, schema application etc, IMHO. >> >> >> >> [image: http://] >> >> Tariq, Mohammad >> about.me/mti >> [image: http://] >> <http://about.me/mti> >> >> >> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasain...@gmail.com> >> wrote: >> >>> Instead of collecting the data frame, you can try using a sqlContext on >>> the data frame. But it depends on what kind of operations are you trying to >>> perform. >>> >>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <donta...@gmail.com> >>> wrote: >>> >>>> Hi list, >>>> >>>> *Scenario :* >>>> I am creating a DStream by reading an Avro object from a Kafka topic >>>> and then converting it into a DataFrame to perform some operations on the >>>> data. I call DataFrame.collect() and perform the intended operation on each >>>> Row of Array[Row] returned by DataFrame.collect(). >>>> >>>> *Problem : * >>>> Calling DataFrame.collect() changes the schema of the underlying >>>> record, thus making it impossible to get the columns by index(as the order >>>> gets changed). >>>> >>>> *Query :* >>>> Is it the way DataFrame.collect() behaves or am I doing something wrong >>>> here? In former case is there any way I can maintain the schema while >>>> getting each Row? >>>> >>>> Any pointers/suggestions would be really helpful. Many thanks! >>>> >>>> >>>> [image: http://] >>>> >>>> Tariq, Mohammad >>>> about.me/mti >>>> [image: http://] >>>> <http://about.me/mti> >>>> >>>> >>> >>> >> >