Re: flatMap() returning large class

2017-12-14 Thread Richard Garris
Hi Don, Good to hear from you. I think the problem is that regardless of whether you use yield or a generator - Spark internally will produce the entire result as a single large JVM object which will blow up your heap space. Would it be possible to shrink the overall size of the image object

kinesis throughput problems

2017-12-14 Thread Jeremy Kelley
We have a largeish kinesis stream with about 25k events per second and each record is around 142k. I have tried multiple cluster sizes, multiple batch sizes, multiple parameters... I am doing minimal transformations on the data. Whatever happens I can sustain consuming 25k with minimal

Re: Feature generation / aggregate functions / timeseries

2017-12-14 Thread Georg Heiler
Also the rdd stat counter will already conpute most of your desired metrics as well as df.describe https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html Georg Heiler schrieb am Do. 14. Dez. 2017 um 19:40: > Look at

Re: Feature generation / aggregate functions / timeseries

2017-12-14 Thread Georg Heiler
Look at custom UADF functions. schrieb am Do. 14. Dez. 2017 um 09:31: > Hi dear spark community ! > > I want to create a lib which generates features for potentially very > large datasets, so I believe spark could be a nice tool for that. > Let me explain what I need to do

Re: flatMap() returning large class

2017-12-14 Thread Marcelo Vanzin
This sounds like something mapPartitions should be able to do, not sure if there's an easier way. On Thu, Dec 14, 2017 at 10:20 AM, Don Drake wrote: > I'm looking for some advice when I have a flatMap on a Dataset that is > creating and returning a sequence of a new case

flatMap() returning large class

2017-12-14 Thread Don Drake
I'm looking for some advice when I have a flatMap on a Dataset that is creating and returning a sequence of a new case class (Seq[BigDataStructure]) that contains a very large amount of data, much larger than the single input record (think images). In python, you can use generators (yield) to

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread salemi
Thank you for your response. In case of an update we need sometime to just update a record and in other cases we need to update the existing record and insert a new record. The statement you proposed doesn't handle that. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
Modern versions of postgres have upsert, ie insert into ... on conflict ... do update On Thu, Dec 14, 2017 at 11:26 AM, salemi wrote: > Thank you for your respond. > The approach loads just the data into the DB. I am looking for an approach > that allows me to update

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread salemi
Thank you for your respond. The approach loads just the data into the DB. I am looking for an approach that allows me to update existing entries in the DB amor insert a new entry if it doesn't exist. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

Spark multithreaded job submission from driver

2017-12-14 Thread Michael Artz
Hi, I was wanting to pull data from about 1500 remote Oracle tables with Spark, and I want to have a multi-threaded application that picks up a table per thread or maybe 10 tables per thread and launches a spark job to read from their respective tables. I read official spark site

Re: bulk upsert data batch from Kafka dstream into Postgres db

2017-12-14 Thread Cody Koeninger
use foreachPartition(), get a connection from a jdbc connection pool, and insert the data the same way you would in a non-spark program. If you're only doing inserts, postgres COPY will be faster (e.g. https://discuss.pivotal.io/hc/en-us/articles/204237003), but if you're doing updates that's not

cosine similarity implementation in Java Spark

2017-12-14 Thread Donni Khan
Hi all, Is there any Implemenation of cosine similarity supports Java? Thanks, Donni

cosine similarity in Java Spark

2017-12-14 Thread Donni Khan
Hi all, Is there any Implemenation of cosine similarity supports Java? Thanks, Donni

Feature generation / aggregate functions / timeseries

2017-12-14 Thread julio . cesare
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp (

Fwd: Feature Generation for Large datasets composed of many time series

2017-12-14 Thread julio . cesare
Hi dear spark community ! I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp (