Re: Creating a Spark context from a Scalatra servlet

2014-02-24 Thread Ognen Duzlevski
Figured it out. I did sbt/sbt assembly on the same jobserver branch and am running that as a standalone spark cluster. I am then running a separate jobserver from the same branch - it all works now. Ognen On 2/24/14, 9:02 AM, Ognen Duzlevski wrote: In any case, I am running the same version

Re: Creating a Spark context from a Scalatra servlet

2014-02-24 Thread Ognen Duzlevski
d.run(ForkJoinWorkerThread.java:107) On 2/23/14, 10:46 PM, Ognen Duzlevski wrote: Nick, thanks. I checked out the code and after briefly reading the docs and playing with it, I have a basic question :) - I have a standalone spark cluster running on the same machine. I am guessing in order for all th

Re: Creating a Spark context from a Scalatra servlet

2014-02-23 Thread Ognen Duzlevski
quite soon though. Haven't upgraded mine to 0.9.0 yet though so I may also run into same issues. Feel free to ping me or Evan who wrote the job server PR with questions. — Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone On Sun, Feb 23, 2014 at 10:49 PM, Ognen Duzl

Re: Creating a Spark context from a Scalatra servlet

2014-02-23 Thread Ognen Duzlevski
Feb 23, 2014 at 10:15 PM, Ognen Duzlevski mailto:og...@nengoiksvelzud.com>> wrote: On 2/23/14, 10:26 AM, Ognen Duzlevski wrote: Hello all, perhaps too ambitiously ;) I have decided to try and roll my own Scalatra app that connects to a spark cluster and executes a query wh

Re: Creating a Spark context from a Scalatra servlet

2014-02-23 Thread Ognen Duzlevski
On 2/23/14, 10:26 AM, Ognen Duzlevski wrote: Hello all, perhaps too ambitiously ;) I have decided to try and roll my own Scalatra app that connects to a spark cluster and executes a query when a certain URL is accessed - I am just trying to figure out how to get these things going. Is

Creating a Spark context from a Scalatra servlet

2014-02-23 Thread Ognen Duzlevski
Hello all, perhaps too ambitiously ;) I have decided to try and roll my own Scalatra app that connects to a spark cluster and executes a query when a certain URL is accessed - I am just trying to figure out how to get these things going. Is there anything I should pay particular attention to

Re: RESTful API for Spark

2014-02-17 Thread Ognen Duzlevski
Sorry, I meant the jobserver-preview branch. Ognen On 2/17/14, 10:31 PM, Ognen Duzlevski wrote: Mayur, On 2/17/14, 7:37 PM, Mayur Rustagi wrote: Check out jobserver on Spark. Haven't used it myself but would love to know your feedback around it. Thanks. I am not sure what you are refe

Re: RESTful API for Spark

2014-02-17 Thread Ognen Duzlevski
Mayur, On 2/17/14, 7:37 PM, Mayur Rustagi wrote: Check out jobserver on Spark. Haven't used it myself but would love to know your feedback around it. Thanks. I am not sure what you are referring to? This? https://github.com/ooyala/incubator-spark Ognen

RESTful API for Spark

2014-02-17 Thread Ognen Duzlevski
Hello, I am constructing a data pipeline which relies on Spark. As part of this exercise, I would like to provide my users a web based interface to submit queries that would translate into Spark jobs. So, something like "user enters dates and other criteria for e.g. retention analysis" => thi

Re: Query execution in spark

2014-02-11 Thread Ognen Duzlevski
On 2/11/14, 12:57 PM, Mayur Rustagi wrote: I have found this quite useful http://www.youtube.com/watch?v=49Hr5xZyTEA Thank you! Ognen

Re: Looking for resources on Map\Reduce concepts

2014-02-04 Thread Ognen Duzlevski
Eran, How much do you know in general about Map/Reduce? Do you know anything? If not, this is a VERY nice explanation of what MR is about at a very high conceptual level: http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/ However, going from the above link to implementing somethi

Re: s3n > 5GB

2014-01-26 Thread Ognen Duzlevski
I have run Spark jobs on multiple 20GB+ files (groupByKey() on filtered contents of these files) via s3n:// and it all worked. Well, if you consider taking forever to read in 20GB worth of a file over a network connection (which is the limiting factor in this scenario) as "worked". I quickly reali

Re: Non-deterministic behavior in spark

2014-01-24 Thread Ognen Duzlevski
gInfo(e._2) ? > > > 2014/1/24 Ognen Duzlevski > >> I can confirm that there is something seriously wrong with this. >> >> If I run the spark-shell with local[4] on the same cluster and run the >> same task on the same hdfs:// files I get an output like >> &

Re: Non-deterministic behavior in spark

2014-01-24 Thread Ognen Duzlevski
= 14137 This is just crazy. Ognen On Fri, Jan 24, 2014 at 1:39 PM, Ognen Duzlevski < og...@plainvanillagames.com> wrote: > Thanks. > > This is a VERY simple example. > > I have two 20 GB json files. Each line in the files has the same format. > I run: val events = filt

Re: Non-deterministic behavior in spark

2014-01-24 Thread Ognen Duzlevski
you code? Such as addi() for > DoubleMatrix. This kind of operation will affect the original data. > > 2. You could try to use Spark replay debugger, there is a assert function. > Hope that helpful. > http://spark-replay-debugger-overview.readthedocs.org/en/latest/ > > > 201

Re: Non-deterministic behavior in spark

2014-01-24 Thread Ognen Duzlevski
If so, the program lost the idempotent feature. You > should specify a seed to it. > > > 2014/1/24 Ognen Duzlevski > >> Hello, >> >> (Sorry for the sensationalist title) :) >> >> If I run Spark on files from S3 and do basic transformation like: >&

Non-deterministic behavior in spark

2014-01-24 Thread Ognen Duzlevski
Hello, (Sorry for the sensationalist title) :) If I run Spark on files from S3 and do basic transformation like: textfile() filter groupByKey count I get one number (e.g. 40,000). If I do the same on the same files from HDFS, the number spat out is completely different (VERY different - someth

Re: How to use cluster for large set of linux files

2014-01-22 Thread Ognen Duzlevski
Manoj, large is a relative term ;) NFS is a rather slow solution, at least that's always been my experience. However, it will work for smaller files. One way to do it is to put the files in S3 on Amazon. However, then your network becomes a limiting factor. The other way to do it is to replicat

Re: Running K-Means on a cluster setup

2014-01-22 Thread Ognen Duzlevski
s.com > https://twitter.com/mayur_rustagi > > > > On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski > wrote: > >> Hello, >> >> I have found that you generally need two separate pools of knowledge to >> be successful in this game :). One is to have enough knowl

Re: Running K-Means on a cluster setup

2014-01-22 Thread Ognen Duzlevski
Hello, I have found that you generally need two separate pools of knowledge to be successful in this game :). One is to have enough knowledge of network topologies, systems, java, scala and whatever else to actually set up the whole system (esp. if your requirements are different than running on a

Re: spark.default.parallelism

2014-01-21 Thread Ognen Duzlevski
On Tue, Jan 21, 2014 at 10:37 PM, Andrew Ash wrote: > Documentation suggestion: > > Default number of tasks to use *across the cluster* for distributed > shuffle operations (groupByKey, reduceByKey, > etc) when not set by user. > > Ognen would that have clarified for you? > Of course :) Thanks!

spark.default.parallelism

2014-01-21 Thread Ognen Duzlevski
This is what docs/configuration.md says about the property: " Default number of tasks to use for distributed shuffle operations (groupByKey, reduceByKey, etc) when not set by user. " If I set this property to, let's say, 4 - what does this mean? 4 tasks per core, per worker, per...? :) Thanks

Re: Spark on private network

2014-01-21 Thread Ognen Duzlevski
What I did on the VPC in Amazon is allow all outgoing traffic from it to the world and allow all traffic to flow freely on the ingress WITHIN my subnet (so for example, if your subnet is 10.10.0.0/24, you allow all machines on the network to connect to each other on any port but that's about all yo

Re: Quality of documentation (rant)

2014-01-21 Thread Ognen Duzlevski
On Mon, Jan 20, 2014 at 11:05 PM, Ognen Duzlevski wrote: > > Thanks. I will try that but your assumption is that something is failing > in an obvious way with a message. By the look of the spark-shell - just > frozen I would say something is "stuck". Will report back. &

Re: Quality of documentation (rant)

2014-01-20 Thread Ognen Duzlevski
Jey, On Mon, Jan 20, 2014 at 10:59 PM, Jey Kottalam wrote: > >> This sounds like either a bug or somehow the S3 library requiring lots > of > >> memory to read a block. There isn’t a separate way to run HDFS over S3. > >> Hadoop just has different implementations of “file systems”, one of > whic

Re: Quality of documentation (rant)

2014-01-20 Thread Ognen Duzlevski
Hi Matei, thanks for replying! On Mon, Jan 20, 2014 at 8:08 PM, Matei Zaharia wrote: > It’s true that the documentation is partly targeting Hadoop users, and > that’s something we need to fix. Perhaps the best solution would be some > kind of tutorial on “here’s how to set up Spark by hand on EC2

Re: Quality of documentation (rant)

2014-01-19 Thread Ognen Duzlevski
On Sun, Jan 19, 2014 at 4:56 PM, Mayur Rustagi wrote: > Here what I would suggest. In order to protect from Human error, start a > ec2 instance with the ec2 script. Copy over the folders as they are well > integrated with hdfs, compatible drivers and versions. Change configuration > to configure s

Re: Quality of documentation (rant)

2014-01-19 Thread Ognen Duzlevski
On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski wrote: > > My basic requirement is to set everything up myself and understand it. For > testing purposes my cluster has 15 xlarge instances and I guess I will just > set up a hadoop cluster to run over these instances for the purposes

Re: Quality of documentation (rant)

2014-01-19 Thread Ognen Duzlevski
and the scala syntax is pretty unintuitive for >> me. >> >> I would be more than happy to assist and work on documentation with you. >> Do you have any ideas on how you want to go about it or some existing plans? >> >> -- Ankur >> >> > On Jan 19, 2014,

Quality of documentation (rant)

2014-01-19 Thread Ognen Duzlevski
Hello, I have been trying to set up a running spark cluster for a while now. Being new to all this, I have tried to rely on the documentation, however, I find it sorely lacking on a few fronts. For example, I think it has a number of built-in assumptions about a person's knowledge of Hadoop or Me

Problems with Spark (Lost executor error)

2014-01-19 Thread Ognen Duzlevski
Hello, I am trying to read in a 20GB file from an S3 bucket. I have verified I can read small files from my cluster. The cluster itself has 15 slaves and a master, each slave has 16GB of RAM, the machines are Amazon m1.xlarge instances. All I am doing is below, however a minute into execution I g

Re: Spark job failing on cluster

2014-01-18 Thread Ognen Duzlevski
Figured out my own problem - it was a routing issue in the VPC. Sorry for the noise. On Sat, Jan 18, 2014 at 7:37 PM, Ognen Duzlevski < og...@plainvanillagames.com> wrote: > On Sat, Jan 18, 2014 at 2:09 PM, Ognen Duzlevski > wrote: > >> Also, could this have to do with the

Re: Spark job failing on cluster

2014-01-18 Thread Ognen Duzlevski
On Sat, Jan 18, 2014 at 2:09 PM, Ognen Duzlevski wrote: > Also, could this have to do with the fact that there is a "/" in the path > to the S3 resource? > Nope, the / has nothing to do with it, verified with a different file in the same bucket. Ognen

Re: Spark job failing on cluster

2014-01-18 Thread Ognen Duzlevski
Also, could this have to do with the fact that there is a "/" in the path to the S3 resource? Thanks! Ognen On Sat, Jan 18, 2014 at 1:49 PM, Ognen Duzlevski wrote: > I am trying to run a simple job on a 3-machine Spark cluster. All three > machines are Amazon instances within

Spark job failing on cluster

2014-01-18 Thread Ognen Duzlevski
I am trying to run a simple job on a 3-machine Spark cluster. All three machines are Amazon instances within the VPC (xlarge size instances). All I am doing is reading a (rather large - about 20GB) file from an S3 bucket and doing some basic filtering on each line. Here I start the spark shell: s

Re: Reading files on a cluster / shared file system

2014-01-16 Thread Ognen Duzlevski
shared file system, then each worker > should read a subset of the files in directory by accessing them locally. > Nothing should be read on the master. > > TD > > > On Wed, Jan 15, 2014 at 3:56 PM, Ognen Duzlevski > wrote: > >> On a cluster where the nodes and the ma

Reading files on a cluster / shared file system

2014-01-15 Thread Ognen Duzlevski
On a cluster where the nodes and the master all have access to a shared filesystem/files - does spark read a file (like one resulting from sc.textFile()) in parallel/different sections on each node? Or is the file read on master in sequence and chunks processed on the nodes afterwards? Thanks! Ogn

General Spark question (streaming)

2014-01-09 Thread Ognen Duzlevski
Hello, I am new to spark and have a few questions that are fairly general in nature: I am trying to set up a real-time data analysis pipeline where I have clients sending events to a collection point (load balanced) and onward the "collectors" send the data to a Spark cluster via zeromq pub/sub (

Re: Noob Spark questions

2013-12-31 Thread Ognen Duzlevski
amples/ZeroMQWordCount.scala > . > > > On Mon, Dec 30, 2013 at 9:41 PM, Ognen Duzlevski > wrote: > >> Can anyone provide any code examples of connecting Spark to zeromq data >> producers for purposes of simple real-time analytics? Even the most basic >> example woul

Re: Noob Spark questions

2013-12-30 Thread Ognen Duzlevski
Can anyone provide any code examples of connecting Spark to zeromq data producers for purposes of simple real-time analytics? Even the most basic example would be nice :) Thanks! On Mon, Dec 23, 2013 at 2:42 PM, Ognen Duzlevski wrote: > Hello, I am new to Spark and have installed it, pla

Re: Noob Spark questions

2013-12-23 Thread Ognen Duzlevski
Hello, On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng wrote: > I am using Java, and Spark has APIs for Java as well. Though there is a > saying that Java in Spark is slower than Scala shell, well, depends on your > requirement. > I am not an expert in Spark, but as far as I know, Spark provide differ

Noob Spark questions

2013-12-23 Thread Ognen Duzlevski
Hello, I am new to Spark and have installed it, played with it a bit, mostly I am reading through the "Fast data processing with Spark" book. One of the first things I realized is that I have to learn Scala, the real-time data analytics part is not supported by the Python API, correct? I don't min