Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to d...@spark.apache.org (and join that group) > >

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Wenchen Fan
Thanks, Chao! On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > We are happy to announce the availability of Apache Spark 3.2.3! > > Spark 3.2.3 is a maintenance release containing stability fixes. This > release is based on the branch-3.2 maintenance branch of Spark. We strongly > recommend

Re: Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-11 Thread Wenchen Fan
Hopefully, this StackOverflow answer can solve your problem: https://stackoverflow.com/questions/47523037/how-do-i-configure-pyspark-to-write-to-hdfs-by-default Spark doesn't control the behavior of qualifying paths. It's decided by certain configs and/or config files. On Tue, Jan 11, 2022 at

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Wenchen Fan
Yea the file naming is a bit confusing, we can fix it in the next release. 3.2 actually means 3.2 or higher, so not a big deal I think. Congrats and thanks! On Wed, Oct 20, 2021 at 3:44 AM Jungtaek Lim wrote: > Thanks to Gengliang for driving this huge release! > > On Wed, Oct 20, 2021 at 1:50

Re: About Spark executs sqlscript

2021-05-24 Thread Wenchen Fan
It's not possible to load everything into memory. We should use a big query connector (should be existing already?) and register table B and C and temp views in Spark. On Fri, May 14, 2021 at 8:50 AM bo zhao wrote: > Hi Team, > > I've followed Spark community for several years. This is my first

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-11 Thread Wenchen Fan
ient APIs to the end users >> in this approach? The users can only call backup or restore, right? >> >> Thanks, >> Tianchen >> >> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan wrote: >> >>> If a catalog implements backup/restore, it can easily expose some

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-07 Thread Wenchen Fan
If a catalog implements backup/restore, it can easily expose some client APIs to the end-users (e.g. REST API), I don't see a strong reason to expose the APIs to Spark. Do you plan to add new SQL commands in Spark to backup/restore a catalog? On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang wrote:

Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Wenchen Fan
Hi Tomas, thanks for reporting this bug! Is it possible to share your dataset so that other people can reproduce and debug it? On Thu, Apr 8, 2021 at 7:52 AM Tomas Bartalos wrote: > when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get > an incorrect result of 0 rows. > >

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Wenchen Fan
Great work and congrats! On Wed, Mar 3, 2021 at 3:51 PM Kent Yao wrote: > Congrats, all! > > Bests, > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > interface

Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Wenchen Fan
Great work, thanks, Ruifeng! On Fri, Sep 11, 2020 at 11:09 PM Gengliang Wang < gengliang.w...@databricks.com> wrote: > Congrats! > Thanks for the work, Ruifeng! > > > On Fri, Sep 11, 2020 at 9:51 PM Takeshi Yamamuro > wrote: > >> Congrats and thanks, Ruifeng! >> >> >> On Fri, Sep 11, 2020 at

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-23 Thread Wenchen Fan
ards,Dhrubajyoti Hati.* > > > On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan wrote: > >> This looks like a bug that path filter doesn't work for hive table >> reading. Can you open a JIRA ticket? >> >> On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati >

Re: Error while reading hive tables with tmp/hidden files inside partitions

2020-04-22 Thread Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket? On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati wrote: > Just wondering if any one could help me out on this. > > Thank you! > > > > > *Regards,Dhrubajyoti Hati.* > > > On Wed, Apr 22, 2020

Re: BUG: take with SparkSession.master[url]

2020-03-27 Thread Wenchen Fan
g from maven. > > Backbutton.co.uk > ¯\_(ツ)_/¯ > ♡۶Java♡۶RMI ♡۶ > Make Use Method {MUM} > makeuse.org > <http://www.backbutton.co.uk> > > > On Fri, 27 Mar 2020 at 05:45, Wenchen Fan wrote: > >> Which Spark/Scala version do you use? >

Re: BUG: take with SparkSession.master[url]

2020-03-26 Thread Wenchen Fan
Which Spark/Scala version do you use? On Fri, Mar 27, 2020 at 1:24 PM Zahid Rahman wrote: > > with the following sparksession configuration > > val spark = SparkSession.builder().master("local[*]").appName("Spark Session > take").getOrCreate(); > > this line works > > flights.filter(flight_row

Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-10 Thread Wenchen Fan
Great Job, Dongjoon! On Mon, Feb 10, 2020 at 4:18 PM Hyukjin Kwon wrote: > Thanks Dongjoon! > > 2020년 2월 9일 (일) 오전 10:49, Takeshi Yamamuro 님이 작성: > >> Happy to hear the release news! >> >> Bests, >> Takeshi >> >> On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun >> wrote: >> >>> There was a typo

Re: A question about radd bytes size

2019-12-01 Thread Wenchen Fan
When we talk about bytes size, we need to specify how the data is stored. For example, if we cache the dataframe, then the bytes size is the number of bytes of the binary format of the table cache. If we write to hive tables, then the bytes size is the total size of the data files of the table.

Re: [DISCUSS] Remove sorting of fields in PySpark SQL Row construction

2019-11-06 Thread Wenchen Fan
Sounds reasonable to me. We should make the behavior consistent within Spark. On Tue, Nov 5, 2019 at 6:29 AM Bryan Cutler wrote: > Currently, when a PySpark Row is created with keyword arguments, the > fields are sorted alphabetically. This has created a lot of confusion with > users because it

Re: Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
Ah sorry I made a mistake. "Spark can only pick BroadcastNestedLoopJoin to implement left/right join" this should be "left/right non-equal join" On Thu, Oct 24, 2019 at 6:32 AM zhangliyun wrote: > > Hi Herman: >I guess what you mentioned before > ``` > if you are OK with slightly different

Re: A question about broadcast nest loop join

2019-10-23 Thread Wenchen Fan
I haven't looked into your query yet, just want to let you know that: Spark can only pick BroadcastNestedLoopJoin to implement left/right join. If the table is very big, then OOM happens. Maybe there is an algorithm to implement left/right join in a distributed environment without broadcast, but

Re: [ANNOUNCE] Announcing Apache Spark 2.4.4

2019-09-01 Thread Wenchen Fan
Great! Thanks! On Mon, Sep 2, 2019 at 5:55 AM Dongjoon Hyun wrote: > We are happy to announce the availability of Spark 2.4.4! > > Spark 2.4.4 is a maintenance release containing stability fixes. This > release is based on the branch-2.4 maintenance branch of Spark. We strongly > recommend all

Re: JDK11 Support in Apache Spark

2019-08-25 Thread Wenchen Fan
Great work! On Sun, Aug 25, 2019 at 6:03 AM Xiao Li wrote: > Thank you for your contributions! This is a great feature for Spark > 3.0! We finally achieve it! > > Xiao > > On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung > wrote: > >> That’s great! >> >> -- >> *From:*

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Wenchen Fan
+1 On Wed, Aug 14, 2019 at 12:52 PM Holden Karau wrote: > +1 > Does anyone have any critical fixes they’d like to see in 2.4.4? > > On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > >> Seems fine to me if there are enough valuable fixes to justify another >> release. If there are any other

Re: Access to live data of cached dataFrame

2019-05-21 Thread Wenchen Fan
When you cache a dataframe, you actually cache a logical plan. That's why re-creating the dataframe doesn't work: Spark finds out the logical plan is cached and picks the cached data. You need to uncache the dataframe, or go back to the SQL way: df.createTempView("abc") spark.table("abc").cache()

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
ml) > > On Wed, Apr 24, 2019 at 6:24 PM Wenchen Fan wrote: > >> How did you read/write the timestamp value from/to ORC file? >> >> On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia < >> shubh.chaura...@gmail.com> wrote: >> >>> Hi Al

Re: DataFrameWriter does not adjust spark.sql.session.timeZone offset while writing orc files

2019-04-24 Thread Wenchen Fan
How did you read/write the timestamp value from/to ORC file? On Wed, Apr 24, 2019 at 6:30 PM Shubham Chaurasia wrote: > Hi All, > > Consider the following(spark v2.4.0): > > Basically I change values of `spark.sql.session.timeZone` and perform an > orc write. Here are 3 samples:- > > 1) >

Re: Manually reading parquet files.

2019-03-22 Thread Wenchen Fan
Try `val enconder = RowEncoder(df.schema).resolveAndBind()` ? On Thu, Mar 21, 2019 at 5:39 PM Long, Andrew wrote: > Thanks a ton for the help! > > > > Is there a standardized way of converting the internal row to rows? > > > > I’ve tried this but im getting an exception > > > > *val *enconder =

Re: spark sql occer error

2019-03-22 Thread Wenchen Fan
Did you include the whole error message? On Fri, Mar 22, 2019 at 12:45 AM 563280193 <563280...@qq.com> wrote: > Hi , > I ran a spark sql like this: > > *select imei,tag, product_id,* > * sum(case when succ1>=1 then 1 else 0 end) as succ,* > * sum(case when fail1>=1 and succ1=0 then 1

Re: [ANNOUNCE] Announcing Apache Spark 2.3.3

2019-02-18 Thread Wenchen Fan
great job! On Mon, Feb 18, 2019 at 4:24 PM Hyukjin Kwon wrote: > Yay! Good job Takeshi! > > On Mon, 18 Feb 2019, 14:47 Takeshi Yamamuro >> We are happy to announce the availability of Spark 2.3.3! >> >> Apache Spark 2.3.3 is a maintenance release, based on the branch-2.3 >> maintenance branch

Re: [ANNOUNCE] Announcing Apache Spark 2.4.0

2018-11-08 Thread Wenchen Fan
+ user list On Fri, Nov 9, 2018 at 2:20 AM Wenchen Fan wrote: > resend > > On Thu, Nov 8, 2018 at 11:02 PM Wenchen Fan wrote: > >> >> >> -- Forwarded message - >> From: Wenchen Fan >> Date: Thu, Nov 8, 2018 at 10:55 PM >> S

Re: BroadcastJoin failed on partitioned parquet table

2018-10-01 Thread Wenchen Fan
I'm not sure if Spark 1.6 is still maintained, can you try a 2.x spark version and see if the problem still exists? On Sun, Sep 30, 2018 at 4:14 PM 白也诗无敌 <445484...@qq.com> wrote: > Besides I have tried ANALYZE statement. It has no use cause I need the > single partition but get the total table

[SPARK-24771] Upgrade AVRO version from 1.7.7 to 1.8

2018-08-14 Thread Wenchen Fan
Hi all, We've upgraded Avro from 1.7 to 1.8, to support date/timestamp/decimal types in the newly added Avro data source in the coming Spark 2.4, and also to make Avro work with Parquet. Since Avro 1.8 is not binary compatible with Avro 1.7 (see https://issues.apache.org/jira/browse/AVRO-1502),

Re: AccumulatorV2 vs AccumulableParam (V1)

2018-05-03 Thread Wenchen Fan
Hi Sergey, Thanks for your valuable feedback! For 1: yea this is definitely a bug and I have sent a PR to fix it. For 2: I have left my comments on the JIRA ticket. For 3: I don't quite understand it, can you give some concrete examples? For 4: yea this is a problem, but I think it's not a big

Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan
Hi, thanks for reporting, can you include the steps to reproduce this bug? On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu wrote: > Did you include any picture ? > > Looks like the picture didn't go thru. > > Please use third party site. > > Thanks > > Original message

Re: How to persistent database/table created in sparkSession

2017-12-05 Thread Wenchen Fan
Try with `SparkSession.builder().enableHiveSupport` ? On Tue, Dec 5, 2017 at 3:22 PM, 163 wrote: > Hi, > How can I persistent database/table created in spark application? > > object TestPersistentDB { > def main(args:Array[String]): Unit = { >

Re: Dataset API Question

2017-10-25 Thread Wenchen Fan
It's because of different API design. *RDD.checkpoint* returns void, which means it mutates the RDD state so you need a *RDD**.isCheckpointed* method to check if this RDD is checkpointed. *Dataset.checkpoint* returns a new Dataset, which means there is no isCheckpointed state in Dataset, and

Re: appendix

2017-06-20 Thread Wenchen Fan
you should make hbase a data source(seems we already have hbase connector?), create a dataframe from hbase, and do join in Spark SQL. > On 21 Jun 2017, at 10:17 AM, sunerhan1...@sina.com wrote: > > Hello, > My scenary is like this: > 1.val df=hivecontext/carboncontex.sql("sql") >