Hi Don,
Good to hear from you. I think the problem is that regardless of whether
you use yield or a generator - Spark internally will produce the entire
result as a single large JVM object which will blow up your heap space.
Would it be possible to shrink the overall size of the image object
We have a largeish kinesis stream with about 25k events per second and each
record is around 142k. I have tried multiple cluster sizes, multiple batch
sizes, multiple parameters... I am doing minimal transformations on the data.
Whatever happens I can sustain consuming 25k with minimal
Also the rdd stat counter will already conpute most of your desired metrics
as well as df.describe
https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Georg Heiler schrieb am Do. 14. Dez. 2017 um
19:40:
> Look at
Look at custom UADF functions.
schrieb am Do. 14. Dez. 2017 um 09:31:
> Hi dear spark community !
>
> I want to create a lib which generates features for potentially very
> large datasets, so I believe spark could be a nice tool for that.
> Let me explain what I need to do
This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.
On Thu, Dec 14, 2017 at 10:20 AM, Don Drake wrote:
> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case
I'm looking for some advice when I have a flatMap on a Dataset that is
creating and returning a sequence of a new case class
(Seq[BigDataStructure]) that contains a very large amount of data, much
larger than the single input record (think images).
In python, you can use generators (yield) to
Thank you for your response. In case of an update we need sometime to just
update a record and in other cases we need to update the existing record and
insert a new record. The statement you proposed doesn't handle that.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
Modern versions of postgres have upsert, ie insert into ... on
conflict ... do update
On Thu, Dec 14, 2017 at 11:26 AM, salemi wrote:
> Thank you for your respond.
> The approach loads just the data into the DB. I am looking for an approach
> that allows me to update
Thank you for your respond.
The approach loads just the data into the DB. I am looking for an approach
that allows me to update existing entries in the DB amor insert a new entry
if it doesn't exist.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
Hi,
I was wanting to pull data from about 1500 remote Oracle tables with
Spark, and I want to have a multi-threaded application that picks up a
table per thread or maybe 10 tables per thread and launches a spark job to
read from their respective tables.
I read official spark site
use foreachPartition(), get a connection from a jdbc connection pool,
and insert the data the same way you would in a non-spark program.
If you're only doing inserts, postgres COPY will be faster (e.g.
https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're
doing updates that's not
Hi all,
Is there any Implemenation of cosine similarity supports Java?
Thanks,
Donni
Hi all,
Is there any Implemenation of cosine similarity supports Java?
Thanks,
Donni
Hi dear spark community !
I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp (
Hi dear spark community !
I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :
Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp (
15 matches
Mail list logo