Proposed additional function to create fold_column for better integration of Spark data frames with H2O

2022-01-06 Thread Chester Gan
Idea: PySpark function to create fold indices (numbers from 0, ..., N-1, where N := number of folds needed for k-fold CV during auto ML training) on train & test datasets ``` # train & test are PySpark dataframes of the train & test datasets respectively import pyspark.sql.functions as F from

Re: Spark 3.2 - ReusedExchange not present in join execution plan

2022-01-06 Thread Abdeali Kothari
Thanks a lot for the reply Albert. On looking at it and reading about it further - I do see that "AdaptiveSparkPlan isFinalPlan=false" is mentioned. Could you point me to how I can see the final plan ? I couldn't find that in any of the resources I was referring to On Fri, 7 Jan 2022, 07:25

Re: Spark 3.2 - ReusedExchange not present in join execution plan

2022-01-06 Thread Albert
I happen to encounter something similar. it's probably because you are just `explain` it. when you actually `run` it. you will get the final spark plan in which case the exchange will be reused. right, this is different compared with 3.1 probably because the upgraded aqe. not sure whether this

How to add a row number column with out reordering my data frame

2022-01-06 Thread Andrew Davidson
Hi I am trying to work through a OOM error. I have 10411 files. I want to select a single column from each file and then join them into a single table. The files have a row unique id. However it is a very long string. The data file with just the name and column of interest is about 470 M. The

Fwd: metastore bug when hive update spark table ?

2022-01-06 Thread Mich Talebzadeh
>From my experience this is an spark issue (more code base diversion on spark-sql from Hive), but of course there is the work-around as below -- Forwarded message - From: Mich Talebzadeh Date: Thu, 6 Jan 2022 at 17:29 Subject: Re: metastore bug when hive update spark table ?

spark metadata metastore bug ?

2022-01-06 Thread Nicolas Paris
Spark can't see hive schema updates partly because it stores the schema in a weird way in hive metastore. 1. FROM SPARK: create a table >>> spark.sql("select 1 col1, 2 >>> col2").write.format("parquet").saveAsTable("my_table") >>> spark.table("my_table").printSchema() root |--

Re: JDBCConnectionProvider in Spark

2022-01-06 Thread Sean Owen
They're in core/ under org.apache.spark.sql.execution.datasources.jdbc.connection I don't quite understand, it's an abstraction over lots of concrete implementations, just simple software design here. You can implement your own provider too I suppose. On Thu, Jan 6, 2022 at 8:22 AM Artemis User

Re: JDBCConnectionProvider in Spark

2022-01-06 Thread Artemis User
The only example I saw in the Spark distribution was ExampleJdbcConnectionProvider file in the examples directory.  It basically just wraps the abstract class with overriding methods.  I guess my question was since Spark embeds the JDBC APIs in the DataFrame reader and writer, why such

Re: JDBCConnectionProvider in Spark

2022-01-06 Thread Sean Owen
There are 8 concrete implementations of it? OracleConnectionProvider, etc On Wed, Jan 5, 2022 at 9:26 PM Artemis User wrote: > Could someone provide some insight/examples on the usage of this API? > >

Re: pyspark

2022-01-06 Thread Gourav Sengupta
Hi, I am not sure at all that we need to use SQLContext and HiveContext anymore. Can you please check your JAVA_HOME, and SPARK_HOME? I use findspark library to enable all environment variables for me regarding spark, or use conda to install pyspark using conda-forge Regards, Gourav Sengupta