My thought is that Spark supports analytics for structured and unstructured
data, batch as well as real time. This was pretty revolutionary when Spark
first came out. That's where the unified term came from I think. Even after
all these years, Spark remains the trusted framework for enterprise
anal
Hi,
I think that it is just a marketing statement. But with SPARK 3.x, now that
you are seeing that SPARK is no more than just another distributed data
processing engine, they are trying to join data pre-processing into ML
pipelines directly. I may call that unified.
But you get the same with sev
Hi,
6 billion rows is quite small, I can do it in my laptop with around 4 GB
RAM. What is the version of SPARK you are using and what is the effective
memory that you have per executor?
Regards,
Gourav Sengupta
On Mon, Oct 19, 2020 at 4:24 AM Lalwani, Jayesh
wrote:
> I have a Dataframe with aro
I have a Dataframe with around 6 billion rows, and about 20 columns. First of
all, I want to write this dataframe out to parquet. The, Out of the 20 columns,
I have 3 columns of interest, and I want to find how many distinct values of
the columns are there in the file. I don’t need the actual di
If it was running fine before and stops working now, one thing I could
think of may be your disk was full. Check your disk space and clean up
your old log files might help...
On 10/18/20 12:06 PM, rajat kumar wrote:
Hello Everyone,
My spark streaming job is running too slow, it is having bat
Apache Spark's mission statement is Apache Spark™ is a unified analytics engine for large-scale data processing.
To what is the word "unified" inferring ?
-
To unsubscribe e-mail: user-unsubscr...@spark
Hello Everyone,
My spark streaming job is running too slow, it is having batch time of 15
seconds and the batch gets completed in 20-22 secs. It was fine till 1st
week October, but it is behaving this way suddenly. I know changing the
batch time can help , but other than that any idea what can be