Re: specifing schema on dataframe

Sam Elamin Mon, 06 Feb 2017 01:01:54 -0800

Yup sorry I should have explained myself better

So I know which columns are numeric and which arent because I have a
StructType and all the internal StructFields will tell me which ones have a
DataType which is numeric and which arent


So assuming I have a json string which has double quotes on numbers when it
shouldnt, and I have the correct schema in a struct type


how can I iterate over them to programatically create the new dataframe in
the correct format

do i iterate over the columns in the StructType? or iterate over the
columns in the dataframe and try to match them with the StructType?

I hope I cleared things up, What I wouldnt do for a drawing board right
now!


On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <guha.a...@gmail.com> wrote:

> Umm....I think the premise is you need to "know" beforehand which columns
> are numeric.....Unless you know it, how would you apply the schema?
>
> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
>> Thanks ayan but I meant how to derive the list automatically
>>
>> In your example you are specifying the numeric columns and I would like
>> it to be applied to any schema if that makes sense
>> On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> SImple (pyspark) example:
>>>
>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>> >>> df.printSchema()
>>> root
>>>  |-- customerid: string (nullable = true)
>>>  |-- foo: string (nullable = true)
>>>
>>> >>> numeric_field_list = ['customerid']
>>>
>>> >>> for k in numeric_field_list:
>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>> ...
>>> >>> df.printSchema()
>>> root
>>>  |-- customerid: long (nullable = true)
>>>  |-- foo: string (nullable = true)
>>>
>>>
>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>> Ok thanks Micheal!
>>>
>>>
>>> Can I get an idea on where to start? Assuming I have the end schema and
>>> the current dataframe...
>>> How can I loop through it and create a new dataframe using the
>>> WithColumn?
>>>
>>>
>>> Am I iterating through the dataframe or the schema?
>>>
>>> I'm assuming it's easier to iterate through the columns in the old df.
>>> For each column cast it correctly and generate a new df?
>>>
>>>
>>> Would you recommend that?
>>>
>>> Regards
>>> Sam
>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>> If you already have the expected schema, and you know that all numbers
>>> will always be formatted as strings in the input JSON, you could probably
>>> derive this list automatically.
>>>
>>> Wouldn't it be simpler to just regex replace the numbers to remove the
>>> quotes?
>>>
>>>
>>> I think this is likely to be a slower and less robust solution.  You
>>> would have to make sure that you got all the corner cases right (i.e.
>>> escaping and what not).
>>>
>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>> I see so for the connector I need to pass in an array/list of numerical
>>> columns?
>>>
>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>> quotes?
>>>
>>>
>>> Regards
>>> Sam
>>>
>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>> Specifying the schema when parsing JSON will only let you pick between
>>> similar datatypes (i.e should this be a short, long float, double etc).  It
>>> will not let you perform conversions like string <-> number.  This has to
>>> be done with explicit casts after the data has been loaded.
>>>
>>> I think you can make a solution that uses select or withColumn generic.
>>> Just load the dataframe with a "parse schema" that treats numbers as
>>> strings.  Then construct a list of columns that should be numbers and apply
>>> the necessary conversions.
>>>
>>> import org.apache.spark.sql.functions.col
>>> var df = spark.read.schema(parseSchema).json("...")
>>> numericColumns.foreach { columnName =>
>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>> }
>>>
>>>
>>>
>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>> Thanks Micheal
>>>
>>> I've been spending the past few days researching this
>>>
>>> The problem is the generated json has double quotes on fields that are
>>> numbers because the producing datastore doesn't want to lose precision
>>>
>>> I can change the data type true but that would be on specific to a job
>>> rather than a generic streaming job. I'm writing a structured streaming
>>> connector and I have the schema the generated dataframe should match.
>>>
>>> Unfortunately using withColumn won't help me here since the solution
>>> needs to be generic
>>>
>>> To summarise assume I have the following json
>>>
>>> [{
>>> "customerid": "535137",
>>> "foo": "bar"
>>> }]
>>>
>>>
>>> and I know the schema should be:
>>> StructType(Array(StructField("customerid",LongType,true),Str
>>> uctField("foo",StringType,true)))
>>>
>>> Whats the best way of solving this?
>>>
>>> My current approach is to iterate over the JSON and identify which
>>> fields are numbers and which arent then recreate the json
>>>
>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>> for advice on this
>>>
>>> Regards
>>> Sam
>>>
>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
>>> wrote:
>>>
>>> -dev
>>>
>>> You can use withColumn to change the type after the data has been loaded
>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>> .
>>>
>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>> Hi Direceu
>>>
>>> Thanks your right! that did work
>>>
>>>
>>> But now im facing an even bigger problem since i dont have access to
>>> change the underlying data, I just want to apply a schema over something
>>> that was written via the sparkContext.newAPIHadoopRDD
>>>
>>> Basically I am reading in a RDD[JsonObject] and would like to convert it
>>> into a dataframe which I pass the schema into
>>>
>>> Whats the best way to do this?
>>>
>>> I doubt removing all the quotes in the JSON is the best solution is it?
>>>
>>> Regards
>>> Sam
>>>
>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>> dirceu.semigh...@gmail.com> wrote:
>>>
>>> Hi Sam
>>> Remove the " from the number that it will work
>>>
>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
>>> escreveu:
>>>
>>> Hi All
>>>
>>> I would like to specify a schema when reading from a json but when
>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>> with no joy!
>>>
>>>
>>> When inferring the schema customer id is set to String, and I would like
>>> to cast it as Double
>>>
>>> so df1 is corrupted while df2 shows
>>>
>>>
>>> Also FYI I need this to be generic as I would like to apply it to any
>>> json, I specified the below schema as an example of the issue I am facing
>>>
>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>>> DoubleType,FloatType, StructType, LongType,DecimalType}
>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>> val df1 = 
>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> val df2 = 
>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>> df1.show(1)
>>> df2.show(1)
>>>
>>>
>>> Any help would be appreciated, I am sure I am missing something obvious
>>> but for the life of me I cant tell what it is!
>>>
>>>
>>> Kind Regards
>>> Sam
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Reply via email to