Re: specifing schema on dataframe

Sam Elamin Mon, 06 Feb 2017 01:10:20 -0800

Ah ok


Thanks for clearing it up Ayan! i will give that a go



Thank you all for your help, this mailing list is awesome!

On Mon, Feb 6, 2017 at 9:07 AM, ayan guha <guha.a...@gmail.com> wrote:

> If I am not missing anything here, "So I know which columns are numeric
> and which arent because I have a StructType and all the internal
> StructFields will tell me which ones have a DataType which is numeric and
> which arent" will lead to getting to a list of fields which should be
> numeric.
>
> Essentially, You will create a list of numeric fields from your
> "should-be" struct type. Then you will load your raw data using built-in
> json reader. At this point, your data have a wrong schema. Now, you will
> need to correct it. How? You will loop over the list of numeric fields (or,
> you can do it directly on the struct type), and try to match the type. If
> you find a mismatch, you'd add a withColumn clause to cast to the correct
> data type (from your "should-be" struct).
>
> HTH?
>
> Best
> Ayan
>
> On Mon, Feb 6, 2017 at 8:00 PM, Sam Elamin <hussam.ela...@gmail.com>
> wrote:
>
>> Yup sorry I should have explained myself better
>>
>> So I know which columns are numeric and which arent because I have a
>> StructType and all the internal StructFields will tell me which ones have a
>> DataType which is numeric and which arent
>>
>> So assuming I have a json string which has double quotes on numbers when
>> it shouldnt, and I have the correct schema in a struct type
>>
>>
>> how can I iterate over them to programatically create the new dataframe
>> in the correct format
>>
>> do i iterate over the columns in the StructType? or iterate over the
>> columns in the dataframe and try to match them with the StructType?
>>
>> I hope I cleared things up, What I wouldnt do for a drawing board right
>> now!
>>
>>
>> On Mon, Feb 6, 2017 at 8:56 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Umm....I think the premise is you need to "know" beforehand which
>>> columns are numeric.....Unless you know it, how would you apply the schema?
>>>
>>> On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin <hussam.ela...@gmail.com>
>>> wrote:
>>>
>>>> Thanks ayan but I meant how to derive the list automatically
>>>>
>>>> In your example you are specifying the numeric columns and I would like
>>>> it to be applied to any schema if that makes sense
>>>> On Mon, 6 Feb 2017 at 08:49, ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> SImple (pyspark) example:
>>>>>
>>>>> >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json")
>>>>> >>> df.printSchema()
>>>>> root
>>>>>  |-- customerid: string (nullable = true)
>>>>>  |-- foo: string (nullable = true)
>>>>>
>>>>> >>> numeric_field_list = ['customerid']
>>>>>
>>>>> >>> for k in numeric_field_list:
>>>>> ...     df = df.withColumn(k,df[k].cast("long"))
>>>>> ...
>>>>> >>> df.printSchema()
>>>>> root
>>>>>  |-- customerid: long (nullable = true)
>>>>>  |-- foo: string (nullable = true)
>>>>>
>>>>>
>>>>> On Mon, Feb 6, 2017 at 6:56 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Ok thanks Micheal!
>>>>>
>>>>>
>>>>> Can I get an idea on where to start? Assuming I have the end schema
>>>>> and the current dataframe...
>>>>> How can I loop through it and create a new dataframe using the
>>>>> WithColumn?
>>>>>
>>>>>
>>>>> Am I iterating through the dataframe or the schema?
>>>>>
>>>>> I'm assuming it's easier to iterate through the columns in the old df.
>>>>> For each column cast it correctly and generate a new df?
>>>>>
>>>>>
>>>>> Would you recommend that?
>>>>>
>>>>> Regards
>>>>> Sam
>>>>> On Mon, 6 Feb 2017 at 01:12, Michael Armbrust <mich...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> If you already have the expected schema, and you know that all numbers
>>>>> will always be formatted as strings in the input JSON, you could probably
>>>>> derive this list automatically.
>>>>>
>>>>> Wouldn't it be simpler to just regex replace the numbers to remove the
>>>>> quotes?
>>>>>
>>>>>
>>>>> I think this is likely to be a slower and less robust solution.  You
>>>>> would have to make sure that you got all the corner cases right (i.e.
>>>>> escaping and what not).
>>>>>
>>>>> On Sun, Feb 5, 2017 at 3:13 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I see so for the connector I need to pass in an array/list of
>>>>> numerical columns?
>>>>>
>>>>> Wouldnt it be simpler to just regex replace the numbers to remove the
>>>>> quotes?
>>>>>
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>> On Sun, Feb 5, 2017 at 11:11 PM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>> Specifying the schema when parsing JSON will only let you pick between
>>>>> similar datatypes (i.e should this be a short, long float, double etc).  
>>>>> It
>>>>> will not let you perform conversions like string <-> number.  This has to
>>>>> be done with explicit casts after the data has been loaded.
>>>>>
>>>>> I think you can make a solution that uses select or withColumn
>>>>> generic.  Just load the dataframe with a "parse schema" that treats 
>>>>> numbers
>>>>> as strings.  Then construct a list of columns that should be numbers and
>>>>> apply the necessary conversions.
>>>>>
>>>>> import org.apache.spark.sql.functions.col
>>>>> var df = spark.read.schema(parseSchema).json("...")
>>>>> numericColumns.foreach { columnName =>
>>>>>   df = df.withColumn(columnName, col(columnName).cast("long"))
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Feb 5, 2017 at 2:09 PM, Sam Elamin <hussam.ela...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Thanks Micheal
>>>>>
>>>>> I've been spending the past few days researching this
>>>>>
>>>>> The problem is the generated json has double quotes on fields that are
>>>>> numbers because the producing datastore doesn't want to lose precision
>>>>>
>>>>> I can change the data type true but that would be on specific to a job
>>>>> rather than a generic streaming job. I'm writing a structured streaming
>>>>> connector and I have the schema the generated dataframe should match.
>>>>>
>>>>> Unfortunately using withColumn won't help me here since the solution
>>>>> needs to be generic
>>>>>
>>>>> To summarise assume I have the following json
>>>>>
>>>>> [{
>>>>> "customerid": "535137",
>>>>> "foo": "bar"
>>>>> }]
>>>>>
>>>>>
>>>>> and I know the schema should be:
>>>>> StructType(Array(StructField("customerid",LongType,true),Str
>>>>> uctField("foo",StringType,true)))
>>>>>
>>>>> Whats the best way of solving this?
>>>>>
>>>>> My current approach is to iterate over the JSON and identify which
>>>>> fields are numbers and which arent then recreate the json
>>>>>
>>>>> But to be honest that doesnt seem like the cleanest approach, so happy
>>>>> for advice on this
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>> On Sun, 5 Feb 2017 at 22:00, Michael Armbrust <mich...@databricks.com>
>>>>> wrote:
>>>>>
>>>>> -dev
>>>>>
>>>>> You can use withColumn to change the type after the data has been
>>>>> loaded
>>>>> <https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1023043053387187/1572067047091340/2840265927289860/latest.html>
>>>>> .
>>>>>
>>>>> On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin <hussam.ela...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hi Direceu
>>>>>
>>>>> Thanks your right! that did work
>>>>>
>>>>>
>>>>> But now im facing an even bigger problem since i dont have access to
>>>>> change the underlying data, I just want to apply a schema over something
>>>>> that was written via the sparkContext.newAPIHadoopRDD
>>>>>
>>>>> Basically I am reading in a RDD[JsonObject] and would like to convert
>>>>> it into a dataframe which I pass the schema into
>>>>>
>>>>> Whats the best way to do this?
>>>>>
>>>>> I doubt removing all the quotes in the JSON is the best solution is
>>>>> it?
>>>>>
>>>>> Regards
>>>>> Sam
>>>>>
>>>>> On Sat, Feb 4, 2017 at 2:13 PM, Dirceu Semighini Filho <
>>>>> dirceu.semigh...@gmail.com> wrote:
>>>>>
>>>>> Hi Sam
>>>>> Remove the " from the number that it will work
>>>>>
>>>>> Em 4 de fev de 2017 11:46 AM, "Sam Elamin" <hussam.ela...@gmail.com>
>>>>> escreveu:
>>>>>
>>>>> Hi All
>>>>>
>>>>> I would like to specify a schema when reading from a json but when
>>>>> trying to map a number to a Double it fails, I tried FloatType and IntType
>>>>> with no joy!
>>>>>
>>>>>
>>>>> When inferring the schema customer id is set to String, and I would
>>>>> like to cast it as Double
>>>>>
>>>>> so df1 is corrupted while df2 shows
>>>>>
>>>>>
>>>>> Also FYI I need this to be generic as I would like to apply it to any
>>>>> json, I specified the below schema as an example of the issue I am facing
>>>>>
>>>>> import org.apache.spark.sql.types.{BinaryType, StringType, StructField, 
>>>>> DoubleType,FloatType, StructType, LongType,DecimalType}
>>>>> val testSchema = StructType(Array(StructField("customerid",DoubleType)))
>>>>> val df1 = 
>>>>> spark.read.schema(testSchema).json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>> val df2 = 
>>>>> spark.read.json(sc.parallelize(Array("""{"customerid":"535137"}""")))
>>>>> df1.show(1)
>>>>> df2.show(1)
>>>>>
>>>>>
>>>>> Any help would be appreciated, I am sure I am missing something
>>>>> obvious but for the life of me I cant tell what it is!
>>>>>
>>>>>
>>>>> Kind Regards
>>>>> Sam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: specifing schema on dataframe

Reply via email to