Thanks for reporting it. Please open a JIRA with a test case.
Cheers,
Xiao
On Wed, May 27, 2020 at 1:42 PM Pasha Finkelshteyn <
pavel.finkelsht...@gmail.com> wrote:
> Hi folks,
>
> I'm implementing Kotlin bindings for Spark and faced strange problem. In
> one cornercase Spark works differently
Hi Randy,
Yes, I'm using parquet on both S3 and hdfs.
On Thu, 28 May, 2020, 2:38 am randy clinton, wrote:
> Is the file Parquet on S3 or is it some other file format?
>
> In general I would assume that HDFS read/writes are more performant for
> spark jobs.
>
> For instance, consider how well
Yes, that's exactly how I am creating them.
Question... Are you using 'Stateful Structured Streaming' in which you've
something like this?
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(
updateAcrossEvents
)
And updating the Accumulator inside 'updateAcrossEvents'?
Is the file Parquet on S3 or is it some other file format?
In general I would assume that HDFS read/writes are more performant for
spark jobs.
For instance, consider how well partitioned your HDFS file is vs the S3
file.
On Wed, May 27, 2020 at 1:51 PM Dark Crusader
wrote:
> Hi Jörn,
>
>
Hi folks,
I'm implementing Kotlin bindings for Spark and faced strange problem. In
one cornercase Spark works differently when wholestage codegen is on or
off.
Does it look like bug ot expected behavior?
--
Regards,
Pasha
Big Data Tools @ JetBrains
signature.asc
Description: PGP signature
Hi Jörn,
Thanks for the reply. I will try to create a easier example to reproduce
the issue.
I will also try your suggestion to look into the UI. Can you guide on what
I should be looking for?
I was already using the s3a protocol to compare the times.
My hunch is that multiple reads from S3
Have you looked in Spark UI why this is the case ?
S3 Reading can take more time - it depends also what s3 url you are using : s3a
vs s3n vs S3.
It could help after some calculation to persist in-memory or on HDFS. You can
also initially load from S3 and store on HDFS and work from there .
Hi all,
I am reading data from hdfs in the form of parquet files (around 3 GB) and
running an algorithm from the spark ml library.
If I create the same spark dataframe by reading data from S3, the same
algorithm takes considerably more time.
I don't understand why this is happening. Is this a
Yes, I am talking about Application specific Accumulators. Actually I am
getting the values printed in my driver log as well as sent to Grafana. Not
sure where and when I saw 0 before. My deploy mode is “client” on a yarn
cluster(not local Mac) where I submit from master node. It should work the
No firm dates; it always depends on RC voting. Another RC is coming soon.
It is however looking pretty close to done.
On Wed, May 27, 2020 at 3:54 AM ARNAV NEGI SOFTWARE ARCHITECT <
negi.ar...@gmail.com> wrote:
> Hi,
>
> I am working on Spark 3.0 preview release for large Spark jobs on
>
I have no idea.
I compiled a docker image that you can find on docker hub and you can do some
experiments with it composing a cluster.
https://hub.docker.com/r/gaetanofabiano/spark
Let me know if you will have news about release
Regards
Inviato da iPhone
> Il giorno 27 mag 2020, alle ore
Hi,
I am working on Spark 3.0 preview release for large Spark jobs on
Kubernetes and preview looks promising.
Can I understand when the Spark 3.0 GA is expected? Definitive dates will
help us plan our roadmap with Spark 3.0.
Arnav Negi / Technical Architect | Web Technology Enthusiast
Hi Team,
We are using spark on Kubernetes, through spark-on-k8s-operator. Our
application deals with multiple updateStateByKey operations. Upon
investigation, we found that the spark application consumes a higher volume of
memory. As spark-on-k8s-operator doesn't give the option to segregate
Hi Kun,
You can use following spark property instead while launching the app
instead of manually enabling it in the code.
spark.sql.catalogImplementation=hive
Kind Regards
Harsh
On Tue, May 26, 2020 at 9:55 PM Kun Huang (COSMOS)
wrote:
>
> Hi Spark experts,
>
> I am seeking for an approach
14 matches
Mail list logo