[ 
https://issues.apache.org/jira/browse/SPARK-36986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rodrigo Boavida updated SPARK-36986:
------------------------------------
    Description: 
Our spark usage, requires us to build an external schema and pass it on while 
creating a DataSet.

While working through this, I found an optimization would improve greatly 
Spark's flexibility to query external schema management.

Scope: ability to retrieve a field's name and schema in one single call, 
requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
*
*/
def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   val 
field = nameToField.get(name)   if(field.isDefined) {     
Some((fieldIndex(name), field.get))   }
else \{     None   }
}

This is particularly useful from an efficiency perspective, when we're parsing 
a Json structure and we want to check for every field what is the name and 
field type already defined in the schema

I will create a corresponding branch for PR review, assuming that there are no 
concerns with the above proposal.

 

  was:
Our spark usage, requires us to build an external schema and pass it on while 
creating a DataSet.

While working through this, I found a couple of optimizations would improve 
greatly Spark's flexibility to handle external schema management.

Scope: ability to retrieve a field's name and schema in one single call, 
requesting to return a tupple by index. 

Means extending the StructType class to support an additional method

This is what the function would look like:

/**
 * Returns the index and field structure by name.
 * If it doesn't find it, returns None.
 * Avoids two client calls/loops to obtain consolidated field info.
*
*/
def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   val 
field = nameToField.get(name)   if(field.isDefined) \{     
Some((fieldIndex(name), field.get))   }
else
{     None   }
}

This is particularly useful from an efficiency perspective, when we're parsing 
a Json structure and we want to check for every field what is the name and 
field type already defined in the schema

I will create a corresponding branch for PR review, assuming that there are no 
concerns with the above proposal.

 


> Improving schema filtering flexibility
> --------------------------------------
>
>                 Key: SPARK-36986
>                 URL: https://issues.apache.org/jira/browse/SPARK-36986
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Rodrigo Boavida
>            Priority: Major
>
> Our spark usage, requires us to build an external schema and pass it on while 
> creating a DataSet.
> While working through this, I found an optimization would improve greatly 
> Spark's flexibility to query external schema management.
> Scope: ability to retrieve a field's name and schema in one single call, 
> requesting to return a tupple by index. 
> Means extending the StructType class to support an additional method
> This is what the function would look like:
> /**
>  * Returns the index and field structure by name.
>  * If it doesn't find it, returns None.
>  * Avoids two client calls/loops to obtain consolidated field info.
> *
> */
> def getIndexAndFieldByName(name: String): Option[(Int, StructField)] = \{   
> val field = nameToField.get(name)   if(field.isDefined) {     
> Some((fieldIndex(name), field.get))   }
> else \{     None   }
> }
> This is particularly useful from an efficiency perspective, when we're 
> parsing a Json structure and we want to check for every field what is the 
> name and field type already defined in the schema
> I will create a corresponding branch for PR review, assuming that there are 
> no concerns with the above proposal.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to