We are using Spark on Kubernetes on AWS (it's a long story) but it does
work. It's still on the raw side but we've been pretty successful.
We configured our cluster primarily with Kube-AWS and auto scaling groups.
There are gotcha's there, but so far we've been quite successful.
Gary Lucas
On
Thanks everyone for their suggestions. Does any of you take care of auto
scale up and down of your underlying spark clusters on AWS?
On Nov 14, 2017 10:46 AM, "lucas.g...@gmail.com"
wrote:
Hi Ashish, bear in mind that EMR has some additional tooling available that
smoothes
Dear All,
I was training the RandomForest with an input dataset having 20,000 columns
and 12,000 rows.
But when I start the training, it shows an exception:
Constant pool for class
org.apache.spark.sql.catalyst.expressions.GeneratedClass$*SpecificColumnarIterator
has grown past JVM limit of
Hi - I had originally posted this as a bug (SPARK-22528) but given my
uncertainty, it was suggested that I send it to the mailing list instead...
We are using Azure Data Lake (ADL) to store our event logs. This worked
fine in 2.1.x, but in 2.2.0 the underlying files are no longer visible to
the
Hi,
I am running spark streaming job and it is not picking up the next batches
but the job is still shows as running on YARN.
is this expected behavior if there is no data or waiting for data to pick
up?
I am almost behind 4 hours of batches (30 min interval)
[image: Inline image 1]
[image:
Is union of 2 Structured streaming dataframes from different sources supported
in 2.2?
We have done a union of 2 streaming dataframes that are from the same source. I
wanted to know if multiple streams are supported or going to be supported in
the future
Hi,
Use explode function, filter operator and collect_list function.
Or "heavier" flatMap.
Pozdrawiam,
Jacek Laskowski
https://about.me/JacekLaskowski
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Apache Spark 2 https://bit.ly/mastering-apache-spark
Follow
I don't see anything obvious, you'd need to do more troubleshooting.
Could also try creating a single rdd for a known range of offsets:
http://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html#creating-an-rdd
On Wed, Nov 15, 2017 at 9:33 PM, jagadish kagitala
Hi,
I have following schema in dataframe and I want to extract key which
matches as MaxSpeed from the array and it's corresponding value of the key.
|-- tags: array (nullable = true)
||-- element: struct (containsNull = true)
|||-- key: string (nullable = true)
|||--
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I
wouldn't be using spark.
On 17 November 2017 at 11:06, Sebastian Piu wrote:
> If you don't want to recalculate you need to hold the results somewhere,
> of you need to save it why don't you so that
If you don't want to recalculate you need to hold the results somewhere, of
you need to save it why don't you so that and then read it again and get
your stats?
On Fri, 17 Nov 2017, 10:03 Fernando Pereira, wrote:
> Dear Spark users
>
> Is it possible to take the output of
Dear Spark users
Is it possible to take the output of a transformation (RDD/Dataframe) and
feed it to two independent transformations without recalculating the first
transformation and without caching the whole dataset?
Consider the case of a very large dataset (1+TB) which suffered several
12 matches
Mail list logo