Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-15 Thread ashok34...@yahoo.com.INVALID
 Many thanks all, especially to Mich. That is what I was looking for.
On Friday, 15 October 2021, 09:28:24 BST, Mich Talebzadeh 
 wrote:  
 
 Spark allows one to define the column format as StructType or list. By default 
Spark assumes that all fields are nullable when creating a dataframe.
To change nullability you need to provide the structure of the columns.
Assume that I have created an RDD in the form
rdd = sc.parallelize(Range). \         map(lambda x: (x, 
usedFunctions.clustered(x,numRows), \                           
usedFunctions.scattered(x,numRows), \                           
usedFunctions.randomised(x,numRows), \                           
usedFunctions.randomString(50), \                           
usedFunctions.padString(x," ",50), \                           
usedFunctions.padSingleChar("x",4000)))
For the above I create a schema with StructType as below:
Schema = StructType([ StructField("ID", IntegerType(), False),                  
    StructField("CLUSTERED", FloatType(), True),                      
StructField("SCATTERED", FloatType(), True),                      
StructField("RANDOMISED", FloatType(), True),                      
StructField("RANDOM_STRING", StringType(), True),                      
StructField("SMALL_VC", StringType(), True),                      
StructField("PADDING", StringType(), True)                    ])
Note that the first column ID is defined as  NOT NULL
Then I can create a dataframe df as below
df= spark.createDataFrame(rdd, schema = Schema)
df.printSchema()
root |-- ID: integer (nullable = false) |-- CLUSTERED: float (nullable = true) 
|-- SCATTERED: float (nullable = true) |-- RANDOMISED: float (nullable = true) 
|-- RANDOM_STRING: string (nullable = true) |-- SMALL_VC: string (nullable = 
true) |-- PADDING: string (nullable = true)


HTH




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID 
 wrote:

Gurus,
I have an RDD in PySpark that I can convert to DF through
df = rdd.toDF()

However, when I do
df.printSchema()

I see the columns as nullable. = true by default
root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- 
COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT 
NULLABLE
Thanking you
  

Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-15 Thread Mich Talebzadeh
Spark allows one to define the column format as StructType or list. By
default Spark assumes that all fields are nullable when creating a
dataframe.

To change nullability you need to provide the structure of the columns.

Assume that I have created an RDD in the form

rdd = sc.parallelize(Range). \
 map(lambda x: (x, usedFunctions.clustered(x,numRows), \
   usedFunctions.scattered(x,numRows), \
   usedFunctions.randomised(x,numRows), \
   usedFunctions.randomString(50), \
   usedFunctions.padString(x," ",50), \
   usedFunctions.padSingleChar("x",4000)))

For the above I create a schema with StructType as below:

Schema = StructType([ *StructField("ID", IntegerType(), False*),
  StructField("CLUSTERED", FloatType(), True),
  StructField("SCATTERED", FloatType(), True),
  StructField("RANDOMISED", FloatType(), True),
  StructField("RANDOM_STRING", StringType(), True),
  StructField("SMALL_VC", StringType(), True),
  StructField("PADDING", StringType(), True)
])

Note that the first column ID is defined as  NOT NULL

Then I can create a dataframe df as below

df= spark.createDataFrame(rdd, schema = Schema)
df.printSchema()

root
 |-- ID: integer (nullable = false)
 |-- CLUSTERED: float (nullable = true)
 |-- SCATTERED: float (nullable = true)
 |-- RANDOMISED: float (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)


HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID
 wrote:

> Gurus,
>
> I have an RDD in PySpark that I can convert to DF through
>
> df = rdd.toDF()
>
> However, when I do
>
> df.printSchema()
>
> I see the columns as nullable. = true by default
>
> root
>  |-- COL-1: long (nullable = true)
>  |-- COl-2: double (nullable = true)
>  |-- COl-3: string (nullable = true)
>
> What would be the easiest way to make COL-1 NOT NULLABLE
>
> Thanking you
>


Re: How to change a DataFrame column from nullable to not nullable in PySpark

2021-10-14 Thread Sonal Goyal
I see some nice answers at
https://stackoverflow.com/questions/46072411/can-i-change-the-nullability-of-a-column-in-my-spark-dataframe

On Thu, 14 Oct 2021 at 5:21 PM, ashok34...@yahoo.com.INVALID
 wrote:

> Gurus,
>
> I have an RDD in PySpark that I can convert to DF through
>
> df = rdd.toDF()
>
> However, when I do
>
> df.printSchema()
>
> I see the columns as nullable. = true by default
>
> root
>  |-- COL-1: long (nullable = true)
>  |-- COl-2: double (nullable = true)
>  |-- COl-3: string (nullable = true)
>
> What would be the easiest way to make COL-1 NOT NULLABLE
>
> Thanking you
>
-- 
Cheers,
Sonal
https://github.com/zinggAI/zingg