目前在官网上没看到Flink SQL中从kafka到Hive的样例?翻了下测试代码,但是需要自己在代码中特殊处理下,求助能给一个在Flink
SQL中直接使用Hive Connector的样例吗?
yinghua...@163.com
Hey Team:
Currently, we were upgrading the spark version from 2.4 to 3.0. But we
found that the applications, which work in spark 2.4, keep failing with
Spark 3.0. We are running Spark on Kubernetes with cluster mode.
In spark-submit, we have "--jars local:///apps-dep/spark-extra-jars/*". It
As I suggested, you need to use repartition(1) in place of coalesce(1)
That will give you a single file output without losing parallelization for the
rest of the job.
From: James Yu
Date: Wednesday, February 3, 2021 at 2:19 PM
To: Silvio Fiorito , user
Subject: Re: Poor performance caused by
Hi,
as always, I would like to first identify the problem before solving the
problem.
So to isolate the problem, first without coalesce try to write the data out
to a storage location and check the time.
Then try to do coalesce to one and check the time.
If the time between writing down between
That sounds like a plan as suggested by Sean, I have also seen caching the
RS before coalesce provides benefits, especially for a minute 50MB data.
Check Spark GUI storage tab for its effect.
HTH
Mich
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi Silvio,
The result file is less than 50 MB in size so I think it is small and
acceptable enough for one task to write.
Your suggestion sounds interesting. Could you guide us further on how to easily
"add a stage boundary"?
Thanks
From: Silvio Fiorito
Sent:
Probably could also be because that coalesce can cause some upstream
transformations to also have parallelism of 1. I think (?) an OK solution
is to cache the result, then coalesce and write. Or combine the files after
the fact. or do what Silvio said.
On Wed, Feb 3, 2021 at 12:55 PM James Yu
I had that issue too and from what I gathered, it is an expected
optimization... Try using repartiion instead
Get BlueMail for Android
On Feb 3, 2021, 11:55, at 11:55, James Yu wrote:
>Hi Team,
>
>We are running into this poor performance issue and seeking your
>suggestion on how to improve
Coalesce is reducing the parallelization of your last stage, in your case to 1
task. So, it’s natural it will give poor performance especially with large
data. If you absolutely need a single file output, you can instead add a stage
boundary and use repartition(1). This will give your query
Hi Team,
We are running into this poor performance issue and seeking your suggestion on
how to improve it:
We have a particular dataset which we aggregate from other datasets and like to
write out to one single file (because it is small enough). We found that after
a series of
Thanks Marco.
This is an approach
# Start as we defined the dataframe to write to Oracle
df2 = house_df. \
select( \
F.date_format('datetaken', '').cast("Integer").alias('YEAR') \
, 'REGIONNAME' \
,
Hello
my 2cents/./
well that will be an integ test to write to a 'dev' database. (which you
might pre-populate and clean up after your runs, so you can have repeatable
data).
then either you
1 - use normal sql and assert that the values you store in your dataframe
are the same as what you get
It appears that the following assertion works assuming that result set can
be = 0 (no data) or > 0 there is data
assert df2.count() >= 0
However, if I wanted to write to a JDBC database from PySpark through a
function (already defined in another module) as below
def
Hi,
In Pytest you want to ensure that the composed DF has the correct return.
Example
df2 = house_df. \
select( \
F.date_format('datetaken', '').cast("Integer").alias('YEAR') \
, 'REGIONNAME' \
,
I suggest one thing you can do is to open another thread for this feature
request
"Having functionality in Spark to allow queries to be gathered and analyzed"
and see what forum responds to it.
HTH
LinkedIn *
Yes Mich,
Mapping the spark sql query that got executed corresponding to an
application Id on yarn would greatly help in analyzing and debugging the
query for any potential problems.
Thanks,
Arpan Bhandari
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
I gather what you are after is a code sniffer for Spark that provides a
form of GUI to get the code that applications run against spark.
I don't think Spark has this type of plug-in although it would be
potentially useful. Some RDBMS provide this. Usually stored on some form of
persistent storage
Why s3a?
Regards,
Gourav Sengupta
On Wed, Feb 3, 2021 at 7:35 AM YoungKun Min wrote:
> Hi,
>
> I have almost the same problem with Ceph RGW, and currently do research
> about Apache Iceberg and Databricks Delta(opensource version).
> I think these libraries can address the problem.
>
>
> 2021년
18 matches
Mail list logo