How to overwrite PySpark DataFrame schema without data scan?

2022-04-12 Thread Rafał Wojdyła
Hello, Anyone has any comment or ideas regarding: https://stackoverflow.com/questions/71610435/how-to-overwrite-pyspark-dataframe-schema-without-data-scan please? Cheers - Rafal

Re: How to change Dataframe schema

2020-05-16 Thread Adi Polak
ply > different schema for that dataframe. Column names will be same, but Data or > Schema may contain some extra columns. > > Is there any way i can apply the schema on top the existing Dataframe ?. > Schema may be just doing the columns reordering in the most of the cases. > > i

How to change Dataframe schema

2020-05-16 Thread Manjunath Shetty H
apply the schema on top the existing Dataframe ?. Schema may be just doing the columns reordering in the most of the cases. i have tried this " DataFrame dfNew = hc.createDataFrame(df.rdd(), ((StructType) DataType.fromJson(schema))); " But this will map the columns base

[Spark SQL]: DataFrame schema resulting in NullPointerException

2017-11-19 Thread Chitral Verma
Hey, I'm working on this use case that involves converting DStreams to Dataframes after some transformations. I've simplified my code into the following snippet so as to reproduce the error. Also, I've mentioned below my environment settings. *Environment:* Spark Version: 2.2.0 Java: 1.8

Re: Dataframe schema...

2016-10-26 Thread Michael Armbrust
On Fri, Oct 21, 2016 at 8:40 PM, Koert Kuipers wrote: > This rather innocent looking optimization flag nullable has caused a lot > of bugs... Makes me wonder if we are better off without it > Yes... my most regretted design decision :( Please give thoughts here:

Re: Dataframe schema...

2016-10-21 Thread Koert Kuipers
This rather innocent looking optimization flag nullable has caused a lot of bugs... Makes me wonder if we are better off without it On Oct 21, 2016 8:37 PM, "Muthu Jayakumar" wrote: > Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. > > Thanks, > Muthu

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
Thanks Cheng Lian for opening the JIRA. I found this with Spark 2.0.0. Thanks, Muthu On Fri, Oct 21, 2016 at 3:30 PM, Cheng Lian wrote: > Yea, confirmed. While analyzing unions, we treat StructTypes with > different field nullabilities as incompatible types and throws this

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Yea, confirmed. While analyzing unions, we treat StructTypes with different field nullabilities as incompatible types and throws this error. Opened https://issues.apache.org/jira/browse/SPARK-18058 to track this issue. Thanks for reporting! Cheng On 10/21/16 3:15 PM, Cheng Lian wrote: Hi

Re: Dataframe schema...

2016-10-21 Thread Cheng Lian
Hi Muthu, What is the version of Spark are you using? This seems to be a bug in the analysis phase. Cheng On 10/21/16 12:50 PM, Muthu Jayakumar wrote: Sorry for the late response. Here is what I am seeing... Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable =

Re: Dataframe schema...

2016-10-21 Thread Muthu Jayakumar
Sorry for the late response. Here is what I am seeing... Schema from parquet file. d1.printSchema() root |-- task_id: string (nullable = true) |-- task_name: string (nullable = true) |-- some_histogram: struct (nullable = true) ||-- values: array (nullable = true) |||--

Re: Dataframe schema...

2016-10-20 Thread Michael Armbrust
What is the issue you see when unioning? On Wed, Oct 19, 2016 at 6:39 PM, Muthu Jayakumar wrote: > Hello Michael, > > Thank you for looking into this query. In my case there seem to be an > issue when I union a parquet file read from disk versus another dataframe > that I

Re: Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
Hello Michael, Thank you for looking into this query. In my case there seem to be an issue when I union a parquet file read from disk versus another dataframe that I construct in-memory. The only difference I see is the containsNull = true. In fact, I do not see any errors with union on the

Re: Dataframe schema...

2016-10-19 Thread Michael Armbrust
Nullable is just a hint to the optimizer that its impossible for there to be a null value in this column, so that it can avoid generating code for null-checks. When in doubt, we set nullable=true since it is always safer to check. Why in particular are you trying to change the nullability of the

Dataframe schema...

2016-10-19 Thread Muthu Jayakumar
Hello there, I am trying to understand how and when does DataFrame (or Dataset) sets nullable = true vs false on a schema. Here is my observation from a sample code I tried... scala> spark.createDataset(Seq((1, "a", 2.0d), (2, "b", 2.0d), (3, "c", 2.0d))).toDF("col1", "col2",

Creating a New Cassandra Table From a DataFrame Schema

2016-04-12 Thread Prateek .
Hi, I am trying to create new Cassandra table by inferring schema from JSON: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md I am not able to get createCassandraTable function on Dataframe: import com.datastax.spark.connector._ df.createCassandraTable(

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
For compatibility reasons, we always write data out as nullable in parquet. Given that that bit is only an optimization that we don't actually make much use of, I'm curious why you are worried that its changing to true? On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam wrote: >

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Let me share my 2 cents. First, this is not documented in the official document. Maybe we should do it? http://spark.apache.org/docs/latest/sql-programming-guide.html Second, nullability is a significant concept in the database people. It is part of schema. Extra codes are needed for evaluating

Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Jerry Lam
Hi Spark users and developers, I have a dataframe with the following schema (Spark 1.5.1): StructType(StructField(type,StringType,true), StructField(timestamp,LongType,false)) After I save the dataframe in parquet and read it back, I get the following schema:

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Michael Armbrust
> > First, this is not documented in the official document. Maybe we should do > it? http://spark.apache.org/docs/latest/sql-programming-guide.html > Pull requests welcome. > Second, nullability is a significant concept in the database people. It is > part of schema. Extra codes are needed for

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Xiao Li
Sure. Will try to do a pull request this week. Schema evolution is always painful for database people. IMO, NULL is a bad design in the original system R. It introduces a lot of problems during the system migration and data integration. Let me find a possible scenario: RDBMS is used as an ODS.

Re: Spark SQL: Preserving Dataframe Schema

2015-10-20 Thread Richard Hillegas
ich...@databricks.com> > Cc: Jerry Lam <chiling...@gmail.com>, "user@spark.apache.org" > <user@spark.apache.org> > Date: 10/20/2015 01:18 PM > Subject: Re: Spark SQL: Preserving Dataframe Schema > > Sure. Will try to do a pull request this week. >

How to get a clean DataFrame schema merge

2015-04-15 Thread Jaonary Rabarisoa
Hi all, If you follow the example of schema merging in the spark documentation http://spark.apache.org/docs/latest/sql-programming-guide.html#schema-merging you obtain the following results when you want to load the result data : single triple double 1 3 null 2 6 null 4

Re: How to get a clean DataFrame schema merge

2015-04-15 Thread Michael Armbrust
Schema merging is not the feature you are looking for. It is designed when you are adding new records (that are not associated with old records), which may or may not have new or missing columns. In your case it looks like you have two datasets that you want to load separately and join on a key.

Re: How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
I forgot to mention that the imageId field is a custom scala object. Do I need to implement some special method to make it works (equal, hashCode ) ? On Tue, Apr 14, 2015 at 5:00 PM, Jaonary Rabarisoa jaon...@gmail.com wrote: Dear all, In the latest version of spark there's a feature called :

How DataFrame schema migration works ?

2015-04-14 Thread Jaonary Rabarisoa
Dear all, In the latest version of spark there's a feature called : automatic partition discovery and Schema migration for parquet. As far as I know, this gives the ability to split the DataFrame into several parquet files, and by just loading the parent directory one can get the global schema of