Re: Receiver and Parallelization

2015-09-25 Thread Adrian Tanase
1) yes, just use .repartition on the inbound stream, this will shuffle data across your whole cluster and process in parallel as specified. 2) yes, although I’m not sure how to do it for a totally custom receiver. Does this help as a starting point?

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906068#comment-14906068 ] Adrian Tanase commented on SPARK-10792: --- https://issues.apache.org/jira/browse/SPARK-8297 seems

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-10792: -- Description: We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s

Re: Custom Hadoop InputSplit, Spark partitions, spark executors/task and Yarn containers

2015-09-24 Thread Adrian Tanase
RE: # because I already have a bunch of InputSplits, do I still need to specify the number of executors to get processing parallelized? I would say it’s best practice to have as many executors as data nodes and as many cores as you can get from the cluster – if YARN has enough resources it

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-10792: -- Priority: Major (was: Minor) > Spark streaming + YARN – executor is not re-created on mach

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-10792: -- Priority: Minor (was: Major) > Spark streaming + YARN – executor is not re-created on mach

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906133#comment-14906133 ] Adrian Tanase commented on SPARK-10792: --- Correct - I forgot to attach a screenshot where

[jira] [Updated] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrian Tanase updated SPARK-10792: -- Attachment: Screen Shot 2015-09-21 at 1.58.28 PM.png > Spark streaming + YARN – execu

[jira] [Commented] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-10792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14906109#comment-14906109 ] Adrian Tanase commented on SPARK-10792: --- Yarn side or Spark side? If it does, shouldn't that also

Re: Spark on YARN / aws - executor lost on node restart

2015-09-24 Thread Adrian Tanase
Closing the loop, I’ve submitted this issue – TD, cc-ing you since it’s spark streaming, not sure who oversees the Yarn module. https://issues.apache.org/jira/browse/SPARK-10792 -adrian From: Adrian Tanase Date: Friday, September 18, 2015 at 6:18 PM To: "user@spark.apache.org<mai

[jira] [Created] (SPARK-10792) Spark streaming + YARN – executor is not re-created on machine restart

2015-09-24 Thread Adrian Tanase (JIRA)
Adrian Tanase created SPARK-10792: - Summary: Spark streaming + YARN – executor is not re-created on machine restart Key: SPARK-10792 URL: https://issues.apache.org/jira/browse/SPARK-10792 Project

Re: How to make Group By/reduceByKey more efficient?

2015-09-24 Thread Adrian Tanase
All the *ByKey aggregations perform an efficient shuffle and preserve partitioning on the output. If all you need is to call reduceByKey, then don’t bother with groupBy. You should use groupBy if you really need all the datapoints from a key for a very custom operation. From the docs: Note:

Re: Scala Limitation - Case Class definition with more than 22 arguments

2015-09-24 Thread Adrian Tanase
+1 on grouping the case classes and creating a hierarchy – as long as you use the data programatically. For DataFrames / SQL the other ideas probably scale better… From: Ted Yu Date: Wednesday, September 23, 2015 at 7:07 AM To: satish chandra j Cc: user Subject: Re: Scala Limitation - Case

Re: reduceByKeyAndWindow confusion

2015-09-24 Thread Adrian Tanase
Let me take a stab at your questions – can you clarify some of the points below? I’m wondering if you’re using the streaming concepts as they were intended… 1. Windowed operations First, I just want to confirm that it is your intention to split the original kafka stream into multiple Dstreams

Re: reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread Adrian Tanase
The 2 operations can't be used inside one another. If you need something like an all time average then you need to keep a tuple (sum, count) to which you add all the new values that come in every batch. The average is then just a map on the state DStream. Makes sense? have I guessed your use

Re: Deploying spark-streaming application on production

2015-09-22 Thread Adrian Tanase
btw I re-read the docs and I want to clarify that reliable receiver + WAL gives you at least once, not exactly once semantics. Sent from my iPhone On 21 Sep 2015, at 21:50, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>> wrote: I'm wondering, isn't this the canoni

Re: Spark Streaming distributed job

2015-09-22 Thread Adrian Tanase
I think you need to dig into the custom receiver implementation. As long as the source is distributed and partitioned, the downstream .map, .foreachXX are all distributed as you would expect. You could look at how the “classic” Kafka receiver is instantiated in the streaming guide and try to

Re: Invalid checkpoint url

2015-09-22 Thread Adrian Tanase
Have you tried simply ssc.checkpoint("checkpoint”)? This should create it in the local folder, has always worked for me when in development on local mode. For the others (/tmp/..) make sure you have rights to write there. -adrian From: srungarapu vamsi Date: Tuesday, September 22, 2015 at 7:59

Re: Remove duplicate keys by always choosing first in file.

2015-09-22 Thread Adrian Tanase
tion is, how does one sort lines in a file by line number. On Tue, Sep 22, 2015 at 6:15 AM, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>> wrote: By looking through the docs and source code, I think you can get away with rdd.zipWithIndex to get the index of each line in

Re: Using Spark for portfolio manager app

2015-09-21 Thread Adrian Tanase
1. reading from kafka has exactly once guarantees - we are using it in production today (with the direct receiver) * ​you will probably have 2 topics, loading both into spark and joining / unioning as needed is not an issue * tons of optimizations you can do there, assuming

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Adrian Tanase
We do - using Spark streaming, Kafka, HDFS all collocated on the same nodes. Works great so far. Spark picks up the location information and reads data from the partitions hosted by the local broker, showing up as NODE_LOCAL in the UI. You also need to look at the locality options in the

Re: What's the best practice to parse JSON using spark

2015-09-21 Thread Adrian Tanase
I've been using spray-json for general JSON ser/deser in scala (spark app), mostly for config files and data exchange. Haven't used it in conjunction with jobs that process large JSON data sources, so can't speak for those use cases. -adrian

Re: Deploying spark-streaming application on production

2015-09-21 Thread Adrian Tanase
I'm wondering, isn't this the canonical use case for WAL + reliable receiver? As far as I know you can tune Mqtt server to wait for ack on messages (qos level 2?). With some support from the client libray you could achieve exactly once semantics on the read side, if you ack message only after

Re: Limiting number of cores per job in multi-threaded driver.

2015-09-18 Thread Adrian Tanase
Reading through the docs it seems that with a combination of FAIR scheduler and maybe pools you can get pretty far. However the smallest unit of scheduled work is the task so probably you need to think about the parallelism of each transformation. I'm guessing that by increasing the level of

Re: Limiting number of cores per job in multi-threaded driver.

2015-09-18 Thread Adrian Tanase
Forgot to mention that you could also restrict the parallelism to 4, essentially using only 4 cores at any given time, however if your job is complex, a stage might be broken into more than 1 task... Sent from my iPhone On 19 Sep 2015, at 08:30, Adrian Tanase <atan...@adobe.com<mailt

Re: Using Spark for portfolio manager app

2015-09-18 Thread Adrian Tanase
Cool use case! You should definitely be able to model it with Spark. For the first question it's pretty easy - you probably need to keep the user portfolios as state using updateStateByKey. You need to consume 2 event sources - user trades and stock changes. You probably want to Cogroup the

Re: Spark on YARN / aws - executor lost on node restart

2015-09-18 Thread Adrian Tanase
dies completely? If there are no ideas on the list, I’ll prepare some logs and follow up with an issue. Thanks, -adrian From: Adrian Tanase Date: Wednesday, September 16, 2015 at 6:01 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Spark on YARN / aws

Re: Spark Streaming application code change and stateful transformations

2015-09-17 Thread Adrian Tanase
This section in the streaming guide also outlines a new option – use 2 versions in parallel for a period of time, controlling the draining / transition in the application level. http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code Also – I would not

Re: Saprk.frame.Akkasize

2015-09-17 Thread Adrian Tanase
Have you reviewed this section of the guide? http://spark.apache.org/docs/latest/programming-guide.html#shared-variables If the dataset is static and you need a copy on all the nodes, you should look at broadcast variables. SQL specific, have you tried loading the dataset using the DataFrame

Re: Input parsing time

2015-09-17 Thread Adrian Tanase
You’re right – everything is captured under Executor Computing Time if it’s your app code. I know that some people have used custom builds of spark that add more timers – they will show-up nicely in the Spark UI. A more light-weight approach is to time it yourself via some counters /

Re: spark performance - executor computing time

2015-09-17 Thread Adrian Tanase
Something similar happened to our job as well - spark streaming, YARN deployed on AWS. One of the jobs was consistently taking 10–15X longer one one machine. Same data volume, data partitioned really well, etc. Are you running on AWS or on prem? We were assuming that one of the VMs in Amazon

Re: How to recovery DStream from checkpoint directory?

2015-09-17 Thread Adrian Tanase
This section in the streaming guide makes your options pretty clear http://spark.apache.org/docs/latest/streaming-programming-guide.html#upgrading-application-code 1. Use 2 versions in parallel, drain the queue up to a point and strat fresh in the new version, only processing events from

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
for debugging purpose. But i have to use foreachRDD so that i can operate on top of this rdd and eventually save to DB. But my actual problem here is to properly convert Array[Byte] to my custom object. On Thu, Sep 17, 2015 at 7:04 PM, Adrian Tanase <atan...@adobe.com<mailto:atan...@ado

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
Why are you calling foreachRdd / collect in the first place? Instead of using a custom decoder, you should simply do – this is code executed on the workers and allows the computation to continue. ForeachRdd and collect are output operations and force the data to be collected on the driver

Re: Spark Streaming kafka directStream value decoder issue

2015-09-17 Thread Adrian Tanase
if i did not get what you are suggesting me to try. On Thu, Sep 17, 2015 at 9:30 PM, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>> wrote: I guess what I'm asking is why not start with a Byte array like in the example that works (using the DefaultDecoder) then map o

Spark on YARN / aws - executor lost on node restart

2015-09-16 Thread Adrian Tanase
Hi all, We’re using spark streaming (1.4.0), deployed on AWS through yarn. It’s a stateful app that reads from kafka (with the new direct API) and we’re checkpointing to HDFS. During some resilience testing, we restarted one of the machines and brought it back online. During the offline

Re: Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Adrian Tanase
If you don't need the counts in betweem the DB writes, you could simply use a 5 min window for the updateStateByKey and use foreachRdd on the resulting DStream. Even simpler, you could use reduceByKeyAndWindow directly. Lastly, you could keep a variable on the driver and check if 5 minutes have

Re: Suggested Method for Execution of Periodic Actions

2015-09-16 Thread Adrian Tanase
<yuzhih...@gmail.com<mailto:yuzhih...@gmail.com>> wrote: bq. and check if 5 minutes have passed What if the duration for the window is longer than 5 minutes ? Cheers On Wed, Sep 16, 2015 at 1:25 PM, Adrian Tanase <atan...@adobe.com<mailto:atan...@adobe.com>>

[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-05-26 Thread Adrian Tanase (JIRA)
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559071#comment-14559071 ] Adrian Tanase commented on SPARK-5206: -- [~tdas] - are there any plans to make

[Bug 1163010] [NEW] package samba4 4.0.0~alpha18.dfsg1-4ubuntu2 failed to install/upgrade: subprocess installed post-installation script returned error exit status 126

2013-04-01 Thread Adrian Tanase
Public bug reported: the package failed to install ProblemType: Package DistroRelease: Ubuntu 12.04 Package: samba4 4.0.0~alpha18.dfsg1-4ubuntu2 ProcVersionSignature: Ubuntu 3.2.0-31.50-generic 3.2.28 Uname: Linux 3.2.0-31-generic i686 NonfreeKernelModules: wl ApportVersion: 2.0.1-0ubuntu13

<    1   2