Flink 1.11.3从Kafka提取数据到Hive问题求助

2021-02-03 Thread yinghua...@163.com
目前在官网上没看到Flink SQL中从kafka到Hive的样例?翻了下测试代码,但是需要自己在代码中特殊处理下,求助能给一个在Flink SQL中直接使用Hive Connector的样例吗? yinghua...@163.com

[Spark on Kubernetes] Spark Application dependency management Question.

2021-02-03 Thread xgong
Hey Team: Currently, we were upgrading the spark version from 2.4 to 3.0. But we found that the applications, which work in spark 2.4, keep failing with Spark 3.0. We are running Spark on Kubernetes with cluster mode. In spark-submit, we have "--jars local:///apps-dep/spark-extra-jars/*". It

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Silvio Fiorito
As I suggested, you need to use repartition(1) in place of coalesce(1) That will give you a single file output without losing parallelization for the rest of the job. From: James Yu Date: Wednesday, February 3, 2021 at 2:19 PM To: Silvio Fiorito , user Subject: Re: Poor performance caused by

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Gourav Sengupta
Hi, as always, I would like to first identify the problem before solving the problem. So to isolate the problem, first without coalesce try to write the data out to a storage location and check the time. Then try to do coalesce to one and check the time. If the time between writing down between

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Mich Talebzadeh
That sounds like a plan as suggested by Sean, I have also seen caching the RS before coalesce provides benefits, especially for a minute 50MB data. Check Spark GUI storage tab for its effect. HTH Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
Hi Silvio, The result file is less than 50 MB in size so I think it is small and acceptable enough for one task to write. Your suggestion sounds interesting. Could you guide us further on how to easily "add a stage boundary"? Thanks From: Silvio Fiorito Sent:

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Sean Owen
Probably could also be because that coalesce can cause some upstream transformations to also have parallelism of 1. I think (?) an OK solution is to cache the result, then coalesce and write. Or combine the files after the fact. or do what Silvio said. On Wed, Feb 3, 2021 at 12:55 PM James Yu

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Stéphane Verlet
I had that issue too and from what I gathered, it is an expected optimization... Try using repartiion instead ⁣Get BlueMail for Android ​ On Feb 3, 2021, 11:55, at 11:55, James Yu wrote: >Hi Team, > >We are running into this poor performance issue and seeking your >suggestion on how to improve

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Silvio Fiorito
Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absolutely need a single file output, you can instead add a stage boundary and use repartition(1). This will give your query

Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
Hi Team, We are running into this poor performance issue and seeking your suggestion on how to improve it: We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough). We found that after a series of

Re: Assertion of return value of dataframe in pytest

2021-02-03 Thread Mich Talebzadeh
Thanks Marco. This is an approach # Start as we defined the dataframe to write to Oracle df2 = house_df. \ select( \ F.date_format('datetaken', '').cast("Integer").alias('YEAR') \ , 'REGIONNAME' \ ,

Re: Assertion of return value of dataframe in pytest

2021-02-03 Thread Sofia’s World
Hello my 2cents/./ well that will be an integ test to write to a 'dev' database. (which you might pre-populate and clean up after your runs, so you can have repeatable data). then either you 1 - use normal sql and assert that the values you store in your dataframe are the same as what you get

Re: Assertion of return value of dataframe in pytest

2021-02-03 Thread Mich Talebzadeh
It appears that the following assertion works assuming that result set can be = 0 (no data) or > 0 there is data assert df2.count() >= 0 However, if I wanted to write to a JDBC database from PySpark through a function (already defined in another module) as below def

Assertion of return value of dataframe in pytest

2021-02-03 Thread Mich Talebzadeh
Hi, In Pytest you want to ensure that the composed DF has the correct return. Example df2 = house_df. \ select( \ F.date_format('datetaken', '').cast("Integer").alias('YEAR') \ , 'REGIONNAME' \ ,

Re: Spark SQL query

2021-02-03 Thread Mich Talebzadeh
I suggest one thing you can do is to open another thread for this feature request "Having functionality in Spark to allow queries to be gathered and analyzed" and see what forum responds to it. HTH LinkedIn *

Re: Spark SQL query

2021-02-03 Thread Arpan Bhandari
Yes Mich, Mapping the spark sql query that got executed corresponding to an application Id on yarn would greatly help in analyzing and debugging the query for any potential problems. Thanks, Arpan Bhandari -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: Spark SQL query

2021-02-03 Thread Mich Talebzadeh
I gather what you are after is a code sniffer for Spark that provides a form of GUI to get the code that applications run against spark. I don't think Spark has this type of plug-in although it would be potentially useful. Some RDBMS provide this. Usually stored on some form of persistent storage

Re: S3a Committer

2021-02-03 Thread Gourav Sengupta
Why s3a? Regards, Gourav Sengupta On Wed, Feb 3, 2021 at 7:35 AM YoungKun Min wrote: > Hi, > > I have almost the same problem with Ceph RGW, and currently do research > about Apache Iceberg and Databricks Delta(opensource version). > I think these libraries can address the problem. > > > 2021년