Cascading Spark Structured streams

2017-12-28 Thread Eric Dain
I need to write a Spark Structured Streaming pipeline that involves multiple aggregations, splitting data into multiple sub-pipes and union them. Also it need to have stateful aggregation with timeout. Spark Structured Streaming support all of the required functionality but not as one stream. I

Re: Standalone Cluster: ClassNotFound org.apache.kafka.common.serialization.ByteArrayDeserializer

2017-12-28 Thread Shixiong(Ryan) Zhu
The cluster mode doesn't upload jars to the driver node. This is a known issue: https://issues.apache.org/jira/browse/SPARK-4160 On Wed, Dec 27, 2017 at 1:27 AM, Geoff Von Allmen wrote: > I’ve tried it both ways. > > Uber jar gives me gives me the following: > >-

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Gourav Sengupta
Hi Jeroen, can you try to then use the EMR version 5.10 instead or EMR version 5.11 instead? can you please try selecting a subnet which is in a different availability zone? if possible just try to increase the number of task instances and see the difference? also in case you are using caching,

Re: Custom Data Source for getting data from Rest based services

2017-12-28 Thread vaish02
We extensively use pubmed & clinical trial databases for our work and it involves making large amount of parametric rest api queries, usually if the data download is large the requests get timed out ad we have to run queries in very small batches . We also extensively use large number(thousands)

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:42, Gourav Sengupta wrote: > In the EMR cluster what are the other applications that you have enabled > (like HIVE, FLUME, Livy, etc). Nothing that I can think of, just a Spark step (unless EMR is doing fancy stuff behind my back). > Are you

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:40, Maximiliano Felice wrote: > I experienced a similar issue a few weeks ago. The situation was a result of > a mix of speculative execution and OOM issues in the container. Interesting! However I don't have any OOM exception in the logs.

Fwd: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 19:25, Patrick Alwell wrote: > You are using groupByKey() have you thought of an alternative like > aggregateByKey() or combineByKey() to reduce shuffling? I am aware of this indeed. I do have a groupByKey() that is difficult to avoid, but the

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Gourav Sengupta
HI Jeroen, Can I get a few pieces of additional information please? In the EMR cluster what are the other applications that you have enabled (like HIVE, FLUME, Livy, etc). Are you using SPARK Session? If yes is your application using cluster mode or client mode? Have you read the EC2 service

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Maximiliano Felice
Hi Jeroen, I experienced a similar issue a few weeks ago. The situation was a result of a mix of speculative execution and OOM issues in the container. First of all, when an executor takes too much time in Spark, it is handled by the YARN speculative execution, which will launch a new executor

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Patrick Alwell
Joren, Anytime there is a shuffle in the network, Spark moves to a new stage. It seems like you are having issues either pre or post shuffle. Have you looked at a resource management tool like ganglia to determine if this is a memory or thread related issue? The spark UI? You are using

Re: Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
On 28 Dec 2017, at 17:41, Richard Qiao wrote: > Are you able to specify which path of data filled up? I can narrow it down to a bunch of files but it's not so straightforward. > Any logs not rolled over? I have to manually terminate the cluster but there is nothing

Spark on EMR suddenly stalling

2017-12-28 Thread Jeroen Miller
Dear Sparkers, Once again in times of desperation, I leave what remains of my mental sanity to this wise and knowledgeable community. I have a Spark job (on EMR 5.8.0) which had been running daily for months, if not the whole year, with absolutely no supervision. This changed all of sudden

Pyspark and searching items from data structures

2017-12-28 Thread Esa Heikkinen
Hi I would want to build pyspark-application, which searches sequential items or events of time series from csv-files. What are the best data structures for this purpose ? Dataframe of pyspark or pandas, or RDD or SQL or something else ? --- Esa

Re: Reading data from OpenTSDB or KairosDB

2017-12-28 Thread marko
Hello, Thanks for your answer. And what do you think about approach of querying data using OpenTSDB/KairosDB piece by piece, creating a dataframe for each piece, and then making a union out of them? This would enable us to store and query data as timeseries and process it using Spark?