[jira] [Updated] (SPARK-36986) Improving external schema management flexibility

Rodrigo Boavida (Jira) Fri, 04 Feb 2022 14:14:07 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rodrigo Boavida updated SPARK-36986:
------------------------------------
    Description: 
Our spark usage, requires us to build an external schema and pass it on while 
creating a DataSet.

While working through this, I found a couple of optimizations would improve 
greatly Spark's flexibility to handle external schema management.

Scope: ability to retrieve a field's name and schema in one single call, 
requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
*
*/
def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   val 
field = nameToField.get(name)   if(field.isDefined) \{     
Some((fieldIndex(name), field.get))   }
else
{     None   }
}

This is particularly useful from an efficiency perspective, when we're parsing 
a Json structure and we want to check for every field what is the name and 
field type already defined in the schema

I will create a corresponding branch for PR review, assuming that there are no 
concerns with the above proposal.

 

  was:
Our spark usage, requires us to build an external schema and pass it on while 
creating a DataSet.

While working through this, I found a couple of optimizations would improve 
greatly Spark's flexibility to handle external schema management.

1 - ability to retrieve a field's name and schema in one single call, 
requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
 *
 */
 def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = {
   val field = nameToField.get(name)
   if(field.isDefined) \{     Some((fieldIndex(name), field.get))   }
else
{     None   }
}

This is particularly useful from an efficiency perspective, when we're parsing 
a Json structure and we want to check for every field what is the name and 
field type already defined in the schema

 

2 - Allowing for a dataset to be created from a schema, and passing the 
corresponding internal rows which the internal types map with the schema 
already defined externally. This allows to create Spark fields based on any 
data structure, without depending on Spark's internal conversions (in 
particular for Json parsing), and improves performance by skipping the 
CatalystConverts job of converting native Java types into Spark types.

This is what the function would look like:

 

/**
 * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
 * the caller to create externally the InternalRow set, as we as define the 
schema externally.
 *
 * @since 3.3.0
 */
 def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = \{  
 val attributes = schema.toAttributes   val plan = LogicalRDD(attributes, 
data)(self)   val qe = sessionState.executePlan(plan)   qe.assertAnalyzed()   
new Dataset[Row](this, plan, RowEncoder(schema)) }

 

This is similar to this function:

def createDataFrame(rows: java.util.List[Row], schema: StructType): DataFrame

But doesn't depend on Spark internally creating the RDD based by inferring for 
example from a Json structure. Which is not useful if we're managing the schema 
externally.

Also skips the Catalyst conversions and corresponding object overhead, making 
the internal rows generation much more efficient, by being done explicitly from 
the caller.

 

I will create a corresponding branch for PR review, assuming that there are no 
concerns with the above proposals.

 


> Improving external schema management flexibility
> ------------------------------------------------
>
>                 Key: SPARK-36986
>                 URL: https://issues.apache.org/jira/browse/SPARK-36986
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Rodrigo Boavida
>            Priority: Major
>
> Our spark usage, requires us to build an external schema and pass it on while 
> creating a DataSet.
> While working through this, I found a couple of optimizations would improve 
> greatly Spark's flexibility to handle external schema management.
> Scope: ability to retrieve a field's name and schema in one single call, 
> requesting to return a tupple by index. 
> Means extending the StructType class to support an additional method
> This is what the function would look like:
> /**
>  * Returns the index and field structure by name.
>  * If it doesn't find it, returns None.
>  * Avoids two client calls/loops to obtain consolidated field info.
> *
> */
> def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   
> val field = nameToField.get(name)   if(field.isDefined) \{     
> Some((fieldIndex(name), field.get))   }
> else
> {     None   }
> }
> This is particularly useful from an efficiency perspective, when we're 
> parsing a Json structure and we want to check for every field what is the 
> name and field type already defined in the schema
> I will create a corresponding branch for PR review, assuming that there are 
> no concerns with the above proposal.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36986) Improving external schema management flexibility

Reply via email to