I think this could be the reason : DataFrame sorts the column of each record lexicographically if we do a *select **. So, if we wish to maintain a specific column ordering while processing we should use do *select col1, col2...* instead of select *.
However, this is just what I feel. Let's wait for comments from the gurus. [image: http://] Tariq, Mohammad about.me/mti [image: http://] <http://about.me/mti> On Thu, Mar 3, 2016 at 5:35 AM, Mohammad Tariq <donta...@gmail.com> wrote: > Cool. Here is it how it goes... > > I am reading Avro objects from a Kafka topic as a DStream, converting it > into a DataFrame so that I can filter out records based on some conditions > and finally do some aggregations on these filtered records. During the > process I also need to tag each record based on the value of a particular > column, and for this I am iterating over Array[Row] returned by > DataFrame.collect(). > > I am good as far as these things are concerned. The only thing which I am > not getting is the reason behind changed column ordering within each Row. > Say my actual record is [Tariq, IN, APAC]. When I > do println(row.mkString("~")) it shows [IN~APAC~Tariq]. > > I hope I was able to explain my use case to you! > > > > [image: http://] > > Tariq, Mohammad > about.me/mti > [image: http://] > <http://about.me/mti> > > > On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasain...@gmail.com> > wrote: > >> Hi Tariq, >> >> Can you tell in brief what kind of operation you have to do? I can try >> helping you out with that. >> In general, if you are trying to use any group operations you can use >> window operations. >> >> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <donta...@gmail.com> >> wrote: >> >>> Hi Sainath, >>> >>> Thank you for the prompt response! >>> >>> Could you please elaborate your answer a bit? I'm sorry I didn't quite >>> get this. What kind of operation I can perform using SQLContext? It just >>> helps us during things like DF creation, schema application etc, IMHO. >>> >>> >>> >>> [image: http://] >>> >>> Tariq, Mohammad >>> about.me/mti >>> [image: http://] >>> <http://about.me/mti> >>> >>> >>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasain...@gmail.com> >>> wrote: >>> >>>> Instead of collecting the data frame, you can try using a sqlContext on >>>> the data frame. But it depends on what kind of operations are you trying to >>>> perform. >>>> >>>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <donta...@gmail.com> >>>> wrote: >>>> >>>>> Hi list, >>>>> >>>>> *Scenario :* >>>>> I am creating a DStream by reading an Avro object from a Kafka topic >>>>> and then converting it into a DataFrame to perform some operations on the >>>>> data. I call DataFrame.collect() and perform the intended operation on >>>>> each >>>>> Row of Array[Row] returned by DataFrame.collect(). >>>>> >>>>> *Problem : * >>>>> Calling DataFrame.collect() changes the schema of the underlying >>>>> record, thus making it impossible to get the columns by index(as the order >>>>> gets changed). >>>>> >>>>> *Query :* >>>>> Is it the way DataFrame.collect() behaves or am I doing something >>>>> wrong here? In former case is there any way I can maintain the schema >>>>> while >>>>> getting each Row? >>>>> >>>>> Any pointers/suggestions would be really helpful. Many thanks! >>>>> >>>>> >>>>> [image: http://] >>>>> >>>>> Tariq, Mohammad >>>>> about.me/mti >>>>> [image: http://] >>>>> <http://about.me/mti> >>>>> >>>>> >>>> >>>> >>> >> >