Re: How to change a DataFrame column from nullable to not nullable in PySpark

ashok34...@yahoo.com.INVALID Fri, 15 Oct 2021 04:29:36 -0700

 Many thanks all, especially to Mich. That is what I was looking for.
    On Friday, 15 October 2021, 09:28:24 BST, Mich Talebzadeh 
<mich.talebza...@gmail.com> wrote:  
 
 Spark allows one to define the column format as StructType or list. By default 
Spark assumes that all fields are nullable when creating a dataframe.
To change nullability you need to provide the structure of the columns.
Assume that I have created an RDD in the form
rdd = sc.parallelize(Range). \         map(lambda x: (x, 
usedFunctions.clustered(x,numRows), \                           
usedFunctions.scattered(x,numRows), \                           
usedFunctions.randomised(x,numRows), \                           
usedFunctions.randomString(50), \                           
usedFunctions.padString(x," ",50), \                           
usedFunctions.padSingleChar("x",4000)))
For the above I create a schema with StructType as below:
Schema = StructType([ StructField("ID", IntegerType(), False),                  
    StructField("CLUSTERED", FloatType(), True),                      
StructField("SCATTERED", FloatType(), True),                      
StructField("RANDOMISED", FloatType(), True),                      
StructField("RANDOM_STRING", StringType(), True),                      
StructField("SMALL_VC", StringType(), True),                      
StructField("PADDING", StringType(), True)                    ])
Note that the first column ID is defined as  NOT NULL
Then I can create a dataframe df as below
df= spark.createDataFrame(rdd, schema = Schema)
df.printSchema()
root |-- ID: integer (nullable = false) |-- CLUSTERED: float (nullable = true) 
|-- SCATTERED: float (nullable = true) |-- RANDOMISED: float (nullable = true) 
|-- RANDOM_STRING: string (nullable = true) |-- SMALL_VC: string (nullable = 
true) |-- PADDING: string (nullable = true)



HTH




   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destructionof data or any other property which may arise from relying 
on this email's technical content is explicitly disclaimed.The author will in 
no case be liable for any monetary damages arising from suchloss, damage or 
destruction. 

 


On Thu, 14 Oct 2021 at 12:50, ashok34...@yahoo.com.INVALID 
<ashok34...@yahoo.com.invalid> wrote:

Gurus,
I have an RDD in PySpark that I can convert to DF through
df = rdd.toDF()

However, when I do
df.printSchema()

I see the columns as nullable. = true by default
root |-- COL-1: long (nullable = true) |-- COl-2: double (nullable = true) |-- 
COl-3: string (nullable = true) What would be the easiest way to make COL-1 NOT 
NULLABLE
Thanking you

Re: How to change a DataFrame column from nullable to not nullable in PySpark

Reply via email to