Hi,
I have REST GET endpoint that gives data packet from a queue. I
am thinking of implementing spark custom receiver to stream the data into spark.
Somebody please help me on how to integrate REST endpoint with spark custom
receiver class.
Regards,
Favas
After adding the sequential ids you might need a repartition? I've found
using monotically increasing id before that the df goes to a single
partition. Usually becomes clear in the spark ui though
On Tue, 6 Oct 2020, 20:38 Sachit Murarka, wrote:
> Yes, Even I tried the same first. Then I moved
Yes, Even I tried the same first. Then I moved to join method because
shuffle spill was happening because row num without partition happens on
single task. Instead of processinf entire dataframe on single task. I have
broken down that into df1 and df2 and joining.
Because df2 is having very less
Try to avoid broadcast. Thought this:
https://towardsdatascience.com/adding-sequential-ids-to-a-spark-dataframe-fa0df5566ff6
could be helpful.
On Tue, Oct 6, 2020 at 12:18 PM Sachit Murarka
wrote:
> Thanks Eve for response.
>
> Yes I know we can use broadcast for smaller datasets,I increased
Thanks Eve for response.
Yes I know we can use broadcast for smaller datasets,I increased the
threshold (4Gb) for the same then also it did not work. and the df3 is
somewhat greater than 2gb.
Trying by removing broadcast as well.. Job is running since 1 hour. Will
let you know.
Thanks
Sachit
How many rows does df3 have? Broadcast joins are a great way to append data
stored in relatively *small* single source of truth data files to large
DataFrames. DataFrames up to 2GB can be broadcasted so a data file with
tens or even hundreds of thousands of rows is a broadcast candidate. Your
Hello Users,
I am facing an issue in spark job where I am doing row number() without
partition by clause because I need to add sequential increasing IDs.
But to avoid the large spill I am not doing row number() over the complete
data frame.
Instead I am applying monotically_increasing id on
My 2 cents is that this is a complicated question since I'm not confident
that Spark is 100% compatible with Hive in terms of query language. I have
an unanswered question in this list about this:
Hi All,
Not sure if I need to ask this question on spark community or hive community.
We have a set of hive scripts that runs on EMR (Tez engine). We would like to
experiment by moving some of it onto Spark. We are planning to experiment with
two options.
1. Use the current code based on
Note: forwarding to list, incorrectly hit "Repliy" first, instead of
"Reply List"
Hello,
Does your code run without enabling fallback mode? Arrow vectorization
might not just get applied - if you still observe "javaToPython" stages
on Spark UI. Also data is not skewed (partitions are too
Hi,
I used these settings but did not see obvious improvement (190 minutes
reduced to 170 minutes):
spark.sql.execution.arrow.pyspark.enabled: True
spark.sql.execution.arrow.pyspark.fallback.enabled: True
This job heavily uses pandas udfs and it runs on a 30 xlarge node emr.
Any idea
Hello,
I have to write the aggregated output stored in DF(about 3GB) in a single
file for that I have tried using repartition(1) as well as coalesce(1).
My Job is failing with the following Exception:
ExecutorLostFailure (executor 32 exited caused by one of the running tasks)
Reason: Remote
Hi Siva
In that case u can use structured streaming foreach / foreachBatch function
which can help you process each record and write it into some sink
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To
Hello,
I am working as a Data Scientist and using Spark with Python while
developing my models.
Our data model has multiple nested structures (*Ex.*
attributes.product.feature.name). For the first two-level, I can read them
as the following with PySpark:
*data.where(col('attributes.product') !=
Hi Jungtaek
Thank you very much for clarification
> 5 окт. 2020 г., в 15:17, Jungtaek Lim
> написал(а):
>
>
> Hi,
>
> That's not explained in the SS guide doc but explained in the scala API doc.
> http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/streaming/GroupState.html
15 matches
Mail list logo