Re: pivoting panda dataframe

ayan guha Wed, 16 Mar 2022 18:59:56 -0700

Column bind is called join in relational world, spark uses the same.

Pivot in true sense is harder to achieve because you really dont know how
many columns you will end up with, but spark has a pivot function


On Thu, 17 Mar 2022 at 9:16 am, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> OK this is the version that works with Panda only without Spark
>
> import random
> import string
> import math
> import datetime
> import time
> import pandas as pd
>
> class UsedFunctions:
>
>   def randomString(self,length):
>     letters = string.ascii_letters
>     result_str = ''.join(random.choice(letters) for i in range(length))
>     return result_str
>
>   def clustered(self,x,numRows):
>     return math.floor(x -1)/numRows
>
>   def scattered(self,x,numRows):
>     return abs((x -1 % numRows))* 1.0
>
>   def randomised(self,seed,numRows):
>     random.seed(seed)
>     return abs(random.randint(0, numRows) % numRows) * 1.0
>
>   def padString(self,x,chars,length):
>     n = int(math.log10(x) + 1)
>     result_str = ''.join(random.choice(chars) for i in range(length-n)) + 
> str(x)
>     return result_str
>
>   def padSingleChar(self,chars,length):
>     result_str = ''.join(chars for i in range(length))
>     return result_str
>
>   def println(self,lst):
>     for ll in lst:
>       print(ll[0])
>
>   def createSomeChars(self):
>       string.ascii_letters = 'ABCDEFGHIJ'
>       return random.choice(string.ascii_letters)
>
> usedFunctions = UsedFunctions()
>
> def main():
>     appName = "RandomDataGenerator"
>     start_time = time.time()
>     randomdata = RandomData()
>     dfRandom = randomdata.generateRamdomData()
>
>
> class RandomData:
>     def generateRamdomData(self):
>       uf = UsedFunctions()
>       numRows = 10
>       start = 1
>       end = start + numRows - 1
>       print("starting at ID = ", start, ",ending on = ", end)
>       Range = range(start, end)
>       df = pd.DataFrame(map(lambda x: (x, usedFunctions.clustered(x, 
> numRows), \
>                                        usedFunctions.scattered(x, numRows), \
>                                        usedFunctions.randomised(x, numRows), \
>                                        usedFunctions.randomString(10), \
>                                        usedFunctions.padString(x, " ", 20), \
>                                        usedFunctions.padSingleChar("z", 20), \
>                                        usedFunctions.createSomeChars()), 
> Range))
>       pd.set_option("display.max_rows", None, "display.max_columns", None)
>       for col_name in df.columns:
>           print(col_name)
>       print(df.groupby(7).groups)
>       ##print(df)
>
> if __name__ == "__main__":
>   main()
>
> and comes back with this
>
>
> starting at ID =  1 ,ending on =  10
>
> 0
>
> 1
>
> 2
>
> 3
>
> 4
>
> 5
>
> 6
>
> 7
>
> {'B': [5, 7], 'D': [4], 'F': [1], 'G': [0, 3, 6, 8], 'J': [2]}
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 15 Mar 2022 at 22:19, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Thanks, I don't want to use Spark, otherwise I can do this.
>>
>> p_dfm = df.toPandas()  # converting spark DF to Pandas DF
>>
>>
>> Can I do it without using Spark?
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 15 Mar 2022 at 22:08, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>> wrote:
>>
>>> You have a pyspark dataframe and you want to convert it to pandas?
>>>
>>> Convert it first to pandas api on spark
>>>
>>>
>>> pf01 = f01.to_pandas_on_spark()
>>>
>>>
>>> Then convert it to pandas
>>>
>>>
>>> pf01 = f01.to_pandas()
>>>
>>> Or?
>>>
>>> tir. 15. mar. 2022, 22:56 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
>>>> Thanks everyone.
>>>>
>>>> I want to do the following in pandas and numpy without using spark.
>>>>
>>>> This is what I do in spark to generate some random data using class
>>>> UsedFunctions (not important).
>>>>
>>>> class UsedFunctions:
>>>>   def randomString(self,length):
>>>>     letters = string.ascii_letters
>>>>     result_str = ''.join(random.choice(letters) for i in range(length))
>>>>     return result_str
>>>>   def clustered(self,x,numRows):
>>>>     return math.floor(x -1)/numRows
>>>>   def scattered(self,x,numRows):
>>>>     return abs((x -1 % numRows))* 1.0
>>>>   def randomised(self,seed,numRows):
>>>>     random.seed(seed)
>>>>     return abs(random.randint(0, numRows) % numRows) * 1.0
>>>>   def padString(self,x,chars,length):
>>>>     n = int(math.log10(x) + 1)
>>>>     result_str = ''.join(random.choice(chars) for i in range(length-n))
>>>> + str(x)
>>>>     return result_str
>>>>   def padSingleChar(self,chars,length):
>>>>     result_str = ''.join(chars for i in range(length))
>>>>     return result_str
>>>>   def println(self,lst):
>>>>     for ll in lst:
>>>>       print(ll[0])
>>>>
>>>>
>>>> usedFunctions = UsedFunctions()
>>>>
>>>> start = 1
>>>> end = start + 9
>>>> print ("starting at ID = ",start, ",ending on = ",end)
>>>> Range = range(start, end)
>>>> rdd = sc.parallelize(Range). \
>>>>          map(lambda x: (x, usedFunctions.clustered(x,numRows), \
>>>>                            usedFunctions.scattered(x,numRows), \
>>>>                            usedFunctions.randomised(x,numRows), \
>>>>                            usedFunctions.randomString(50), \
>>>>                            usedFunctions.padString(x," ",50), \
>>>>                            usedFunctions.padSingleChar("x",4000)))
>>>> df = rdd.toDF()
>>>>
>>>> OK how can I create a panda DataFrame df without using Spark?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 15 Mar 2022 at 21:19, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Andrew. Mitch asked, and I answered transpose()
>>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transpose.html
>>>>> .
>>>>>
>>>>> And now you are asking in the same thread about pandas API on spark
>>>>> and the transform().
>>>>>
>>>>> Apache Spark have pandas API on Spark.
>>>>>
>>>>> Which means that spark has an API call for pandas functions, and when
>>>>> you use pandas API on spark it is spark you are using then.
>>>>>
>>>>> Add this line in yours import
>>>>>
>>>>> from pyspark import pandas as ps
>>>>>
>>>>>
>>>>> Now you can pass yours dataframe back and forward to pandas API on
>>>>> spark by using
>>>>>
>>>>> pf01 = f01.to_pandas_on_spark()
>>>>>
>>>>>
>>>>> f01 = pf01.to_spark()
>>>>>
>>>>>
>>>>> Note that I have changed pd to ps here.
>>>>>
>>>>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)})
>>>>>
>>>>> df.transform(lambda x: x + 1)
>>>>>
>>>>> You will now see that all numbers are +1
>>>>>
>>>>> You can find more information about pandas API on spark transform
>>>>> https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.transform.html?highlight=pyspark%20pandas%20dataframe%20transform#pyspark.pandas.DataFrame.transform
>>>>> or in yours notbook
>>>>> df.transform?
>>>>>
>>>>> Signature:
>>>>> df.transform(
>>>>>     func: Callable[..., ForwardRef('Series')],
>>>>>     axis: Union[int, str] = 0,
>>>>>     *args: Any,
>>>>>     **kwargs: Any,) -> 'DataFrame'Docstring:
>>>>> Call ``func`` on self producing a Series with transformed values
>>>>> and that has the same length as its input.
>>>>>
>>>>> See also `Transform and apply a function
>>>>> <https://koalas.readthedocs.io/en/latest/user_guide/transform_apply.html>`_.
>>>>>
>>>>> .. note:: this API executes the function once to infer the type which is
>>>>>      potentially expensive, for instance, when the dataset is created 
>>>>> after
>>>>>      aggregations or sorting.
>>>>>
>>>>>      To avoid this, specify return type in ``func``, for instance, as 
>>>>> below:
>>>>>
>>>>>      >>> def square(x) -> ps.Series[np.int32]:
>>>>>      ...     return x ** 2
>>>>>
>>>>>      pandas-on-Spark uses return type hint and does not try to infer the 
>>>>> type.
>>>>>
>>>>> .. note:: the series within ``func`` is actually multiple pandas series 
>>>>> as the
>>>>>     segments of the whole pandas-on-Spark series; therefore, the length 
>>>>> of each series
>>>>>     is not guaranteed. As an example, an aggregation against each series
>>>>>     does work as a global aggregation but an aggregation of each segment. 
>>>>> See
>>>>>     below:
>>>>>
>>>>>     >>> def func(x) -> ps.Series[np.int32]:
>>>>>     ...     return x + sum(x)
>>>>>
>>>>> Parameters
>>>>> ----------
>>>>> func : function
>>>>>     Function to use for transforming the data. It must work when pandas 
>>>>> Series
>>>>>     is passed.
>>>>> axis : int, default 0 or 'index'
>>>>>     Can only be set to 0 at the moment.
>>>>> *args
>>>>>     Positional arguments to pass to func.
>>>>> **kwargs
>>>>>     Keyword arguments to pass to func.
>>>>>
>>>>> Returns
>>>>> -------
>>>>> DataFrame
>>>>>     A DataFrame that must have the same length as self.
>>>>>
>>>>> Raises
>>>>> ------
>>>>> Exception : If the returned DataFrame has a different length than self.
>>>>>
>>>>> See Also
>>>>> --------
>>>>> DataFrame.aggregate : Only perform aggregating type operations.
>>>>> DataFrame.apply : Invoke function on DataFrame.
>>>>> Series.transform : The equivalent function for Series.
>>>>>
>>>>> Examples
>>>>> --------
>>>>> >>> df = ps.DataFrame({'A': range(3), 'B': range(1, 4)}, columns=['A', 
>>>>> >>> 'B'])
>>>>> >>> df
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  2
>>>>> 2  2  3
>>>>>
>>>>> >>> def square(x) -> ps.Series[np.int32]:
>>>>> ...     return x ** 2
>>>>> >>> df.transform(square)
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  4
>>>>> 2  4  9
>>>>>
>>>>> You can omit the type hint and let pandas-on-Spark infer its type.
>>>>>
>>>>> >>> df.transform(lambda x: x ** 2)
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  4
>>>>> 2  4  9
>>>>>
>>>>> For multi-index columns:
>>>>>
>>>>> >>> df.columns = [('X', 'A'), ('X', 'B')]
>>>>> >>> df.transform(square)  # doctest: +NORMALIZE_WHITESPACE
>>>>>    X
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  4
>>>>> 2  4  9
>>>>>
>>>>> >>> (df * -1).transform(abs)  # doctest: +NORMALIZE_WHITESPACE
>>>>>    X
>>>>>    A  B
>>>>> 0  0  1
>>>>> 1  1  2
>>>>> 2  2  3
>>>>>
>>>>> You can also specify extra arguments.
>>>>>
>>>>> >>> def calculation(x, y, z) -> ps.Series[int]:
>>>>> ...     return x ** y + z
>>>>> >>> df.transform(calculation, y=10, z=20)  # doctest: 
>>>>> >>> +NORMALIZE_WHITESPACE
>>>>>       X
>>>>>       A      B
>>>>> 0    20     21
>>>>> 1    21   1044
>>>>> 2  1044  59069File:      /opt/spark/python/pyspark/pandas/frame.pyType:   
>>>>>    method
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> tir. 15. mar. 2022 kl. 19:33 skrev Andrew Davidson <aedav...@ucsc.edu
>>>>> >:
>>>>>
>>>>>> Hi Bjorn
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have been looking for spark transform for a while. Can you send me
>>>>>> a link to the pyspark function?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I assume pandas transform is not really an option. I think it will
>>>>>> try to pull the entire dataframe into the drivers memory.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kind regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>>
>>>>>>
>>>>>> p.s. My real problem is that spark does not allow you to bind
>>>>>> columns. You can use union() to bind rows. I could get the equivalent of
>>>>>> cbind() using union().transform()
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From: *Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>>>>> *Date: *Tuesday, March 15, 2022 at 10:37 AM
>>>>>> *To: *Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>>> *Cc: *"user @spark" <user@spark.apache.org>
>>>>>> *Subject: *Re: pivoting panda dataframe
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.transpose.html
>>>>>>  we
>>>>>> have that transpose in pandas api for spark to.
>>>>>>
>>>>>>
>>>>>>
>>>>>> You also have stack() and multilevel
>>>>>> https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> tir. 15. mar. 2022 kl. 17:50 skrev Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com>:
>>>>>>
>>>>>>
>>>>>> hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Is it possible to pivot a panda dataframe by making the row column
>>>>>> heading?
>>>>>>
>>>>>>
>>>>>>
>>>>>> thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  [image: Image removed by sender.]  view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4
>>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>>> 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4
>>>>> <https://www.google.com/maps/search/Vestre+Aspehaug+4?entry=gmail&source=g>,
>>>>> 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>> --
Best Regards,
Ayan Guha

Re: pivoting panda dataframe

Reply via email to