Figured it out. I did sbt/sbt assembly on the same jobserver branch and
am running that as a standalone spark cluster. I am then running a
separate jobserver from the same branch - it all works now.
Ognen
On 2/24/14, 9:02 AM, Ognen Duzlevski wrote:
In any case,
I am running the same version
d.run(ForkJoinWorkerThread.java:107)
On 2/23/14, 10:46 PM, Ognen Duzlevski wrote:
Nick, thanks.
I checked out the code and after briefly reading the docs and playing
with it, I have a basic question :) - I have a standalone spark
cluster running on the same machine. I am guessing in order for all
th
quite
soon though.
Haven't upgraded mine to 0.9.0 yet though so I may also run into same
issues.
Feel free to ping me or Evan who wrote the job server PR with questions.
—
Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone
On Sun, Feb 23, 2014 at 10:49 PM, Ognen Duzl
Feb 23, 2014 at 10:15 PM, Ognen Duzlevski
mailto:og...@nengoiksvelzud.com>> wrote:
On 2/23/14, 10:26 AM, Ognen Duzlevski wrote:
Hello all,
perhaps too ambitiously ;) I have decided to try and roll my own
Scalatra app that connects to a spark cluster and executes a
query wh
On 2/23/14, 10:26 AM, Ognen Duzlevski wrote:
Hello all,
perhaps too ambitiously ;) I have decided to try and roll my own
Scalatra app that connects to a spark cluster and executes a query
when a certain URL is accessed - I am just trying to figure out how to
get these things going.
Is
Hello all,
perhaps too ambitiously ;) I have decided to try and roll my own
Scalatra app that connects to a spark cluster and executes a query when
a certain URL is accessed - I am just trying to figure out how to get
these things going.
Is there anything I should pay particular attention to
Sorry, I meant the jobserver-preview branch.
Ognen
On 2/17/14, 10:31 PM, Ognen Duzlevski wrote:
Mayur,
On 2/17/14, 7:37 PM, Mayur Rustagi wrote:
Check out jobserver on Spark. Haven't used it myself but would love
to know your feedback around it.
Thanks. I am not sure what you are refe
Mayur,
On 2/17/14, 7:37 PM, Mayur Rustagi wrote:
Check out jobserver on Spark. Haven't used it myself but would love to
know your feedback around it.
Thanks. I am not sure what you are referring to? This?
https://github.com/ooyala/incubator-spark
Ognen
Hello,
I am constructing a data pipeline which relies on Spark. As part of this
exercise, I would like to provide my users a web based interface to
submit queries that would translate into Spark jobs. So, something like
"user enters dates and other criteria for e.g. retention analysis" =>
thi
On 2/11/14, 12:57 PM, Mayur Rustagi wrote:
I have found this quite useful
http://www.youtube.com/watch?v=49Hr5xZyTEA
Thank you!
Ognen
Eran,
How much do you know in general about Map/Reduce? Do you know anything?
If not, this is a VERY nice explanation of what MR is about at a very
high conceptual level:
http://ksat.me/map-reduce-a-really-simple-introduction-kloudo/
However, going from the above link to implementing somethi
I have run Spark jobs on multiple 20GB+ files (groupByKey() on filtered
contents of these files) via s3n:// and it all worked. Well, if you
consider taking forever to read in 20GB worth of a file over a network
connection (which is the limiting factor in this scenario) as "worked".
I quickly reali
gInfo(e._2) ?
>
>
> 2014/1/24 Ognen Duzlevski
>
>> I can confirm that there is something seriously wrong with this.
>>
>> If I run the spark-shell with local[4] on the same cluster and run the
>> same task on the same hdfs:// files I get an output like
>>
&
= 14137
This is just crazy.
Ognen
On Fri, Jan 24, 2014 at 1:39 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:
> Thanks.
>
> This is a VERY simple example.
>
> I have two 20 GB json files. Each line in the files has the same format.
> I run: val events = filt
you code? Such as addi() for
> DoubleMatrix. This kind of operation will affect the original data.
>
> 2. You could try to use Spark replay debugger, there is a assert function.
> Hope that helpful.
> http://spark-replay-debugger-overview.readthedocs.org/en/latest/
>
>
> 201
If so, the program lost the idempotent feature. You
> should specify a seed to it.
>
>
> 2014/1/24 Ognen Duzlevski
>
>> Hello,
>>
>> (Sorry for the sensationalist title) :)
>>
>> If I run Spark on files from S3 and do basic transformation like:
>&
Hello,
(Sorry for the sensationalist title) :)
If I run Spark on files from S3 and do basic transformation like:
textfile()
filter
groupByKey
count
I get one number (e.g. 40,000).
If I do the same on the same files from HDFS, the number spat out is
completely different (VERY different - someth
Manoj,
large is a relative term ;)
NFS is a rather slow solution, at least that's always been my experience.
However, it will work for smaller files.
One way to do it is to put the files in S3 on Amazon. However, then your
network becomes a limiting factor.
The other way to do it is to replicat
s.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Jan 22, 2014 at 8:20 PM, Ognen Duzlevski > wrote:
>
>> Hello,
>>
>> I have found that you generally need two separate pools of knowledge to
>> be successful in this game :). One is to have enough knowl
Hello,
I have found that you generally need two separate pools of knowledge to be
successful in this game :). One is to have enough knowledge of network
topologies, systems, java, scala and whatever else to actually set up the
whole system (esp. if your requirements are different than running on a
On Tue, Jan 21, 2014 at 10:37 PM, Andrew Ash wrote:
> Documentation suggestion:
>
> Default number of tasks to use *across the cluster* for distributed
> shuffle operations (groupByKey, reduceByKey,
> etc) when not set by user.
>
> Ognen would that have clarified for you?
>
Of course :)
Thanks!
This is what docs/configuration.md says about the property:
" Default number of tasks to use for distributed shuffle operations
(groupByKey,
reduceByKey, etc) when not set by user.
"
If I set this property to, let's say, 4 - what does this mean? 4 tasks per
core, per worker, per...? :)
Thanks
What I did on the VPC in Amazon is allow all outgoing traffic from it to
the world and allow all traffic to flow freely on the ingress WITHIN my
subnet (so for example, if your subnet is 10.10.0.0/24, you allow all
machines on the network to connect to each other on any port but that's
about all yo
On Mon, Jan 20, 2014 at 11:05 PM, Ognen Duzlevski
wrote:
>
> Thanks. I will try that but your assumption is that something is failing
> in an obvious way with a message. By the look of the spark-shell - just
> frozen I would say something is "stuck". Will report back.
&
Jey,
On Mon, Jan 20, 2014 at 10:59 PM, Jey Kottalam wrote:
> >> This sounds like either a bug or somehow the S3 library requiring lots
> of
> >> memory to read a block. There isn’t a separate way to run HDFS over S3.
> >> Hadoop just has different implementations of “file systems”, one of
> whic
Hi Matei, thanks for replying!
On Mon, Jan 20, 2014 at 8:08 PM, Matei Zaharia wrote:
> It’s true that the documentation is partly targeting Hadoop users, and
> that’s something we need to fix. Perhaps the best solution would be some
> kind of tutorial on “here’s how to set up Spark by hand on EC2
On Sun, Jan 19, 2014 at 4:56 PM, Mayur Rustagi wrote:
> Here what I would suggest. In order to protect from Human error, start a
> ec2 instance with the ec2 script. Copy over the folders as they are well
> integrated with hdfs, compatible drivers and versions. Change configuration
> to configure s
On Sun, Jan 19, 2014 at 2:49 PM, Ognen Duzlevski
wrote:
>
> My basic requirement is to set everything up myself and understand it. For
> testing purposes my cluster has 15 xlarge instances and I guess I will just
> set up a hadoop cluster to run over these instances for the purposes
and the scala syntax is pretty unintuitive for
>> me.
>>
>> I would be more than happy to assist and work on documentation with you.
>> Do you have any ideas on how you want to go about it or some existing plans?
>>
>> -- Ankur
>>
>> > On Jan 19, 2014,
Hello,
I have been trying to set up a running spark cluster for a while now. Being
new to all this, I have tried to rely on the documentation, however, I find
it sorely lacking on a few fronts.
For example, I think it has a number of built-in assumptions about a
person's knowledge of Hadoop or Me
Hello,
I am trying to read in a 20GB file from an S3 bucket. I have verified I can
read small files from my cluster. The cluster itself has 15 slaves and a
master, each slave has 16GB of RAM, the machines are Amazon m1.xlarge
instances.
All I am doing is below, however a minute into execution I g
Figured out my own problem - it was a routing issue in the VPC. Sorry for
the noise.
On Sat, Jan 18, 2014 at 7:37 PM, Ognen Duzlevski <
og...@plainvanillagames.com> wrote:
> On Sat, Jan 18, 2014 at 2:09 PM, Ognen Duzlevski > wrote:
>
>> Also, could this have to do with the
On Sat, Jan 18, 2014 at 2:09 PM, Ognen Duzlevski
wrote:
> Also, could this have to do with the fact that there is a "/" in the path
> to the S3 resource?
>
Nope, the / has nothing to do with it, verified with a different file in
the same bucket.
Ognen
Also, could this have to do with the fact that there is a "/" in the path
to the S3 resource?
Thanks!
Ognen
On Sat, Jan 18, 2014 at 1:49 PM, Ognen Duzlevski
wrote:
> I am trying to run a simple job on a 3-machine Spark cluster. All three
> machines are Amazon instances within
I am trying to run a simple job on a 3-machine Spark cluster. All three
machines are Amazon instances within the VPC (xlarge size instances). All I
am doing is reading a (rather large - about 20GB) file from an S3 bucket
and doing some basic filtering on each line.
Here I start the spark shell:
s
shared file system, then each worker
> should read a subset of the files in directory by accessing them locally.
> Nothing should be read on the master.
>
> TD
>
>
> On Wed, Jan 15, 2014 at 3:56 PM, Ognen Duzlevski > wrote:
>
>> On a cluster where the nodes and the ma
On a cluster where the nodes and the master all have access to a shared
filesystem/files - does spark read a file (like one resulting from
sc.textFile()) in parallel/different sections on each node? Or is the file
read on master in sequence and chunks processed on the nodes afterwards?
Thanks!
Ogn
Hello,
I am new to spark and have a few questions that are fairly general in
nature:
I am trying to set up a real-time data analysis pipeline where I have
clients sending events to a collection point (load balanced) and onward the
"collectors" send the data to a Spark cluster via zeromq pub/sub (
amples/ZeroMQWordCount.scala
> .
>
>
> On Mon, Dec 30, 2013 at 9:41 PM, Ognen Duzlevski > wrote:
>
>> Can anyone provide any code examples of connecting Spark to zeromq data
>> producers for purposes of simple real-time analytics? Even the most basic
>> example woul
Can anyone provide any code examples of connecting Spark to zeromq data
producers for purposes of simple real-time analytics? Even the most basic
example would be nice :)
Thanks!
On Mon, Dec 23, 2013 at 2:42 PM, Ognen Duzlevski
wrote:
> Hello, I am new to Spark and have installed it, pla
Hello,
On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng wrote:
> I am using Java, and Spark has APIs for Java as well. Though there is a
> saying that Java in Spark is slower than Scala shell, well, depends on your
> requirement.
> I am not an expert in Spark, but as far as I know, Spark provide differ
Hello, I am new to Spark and have installed it, played with it a bit,
mostly I am reading through the "Fast data processing with Spark" book.
One of the first things I realized is that I have to learn Scala, the
real-time data analytics part is not supported by the Python API, correct?
I don't min
42 matches
Mail list logo