Hi,
We have a quite long winded Spark application we inherited with many stages.
When we run on our spark cluster, things start off well enough. Workers are
busy, lots of progress made, etc. etc. However, 30 minutes into processing, we
see CPU usage of the workers drop drastically. At this
Default is 10mb. Depends on memory available, and what the network transfer
effects are going to be. You can specify spark.sql.autoBroadcastJoinThreshold
to increase the threshold in case of spark sql. But you definitely shouldn't be
broadcasting gigabytes.
Hi,
I'm looking to have spark jobs access S3 with temporary credentials. I've seen
some examples around AssumeRole, but I have a scenario where the temp
credentials are provided by GetFederationToken. Is there anything that can
help, or do I need to use boto to execute GetFederationToken, and
I'm trying to easily create custom encoders for case classes having
"unfriendly" fields. I could just kryo the whole thing, but would like to at
least have a few fields in the schema instead of one binary blob. For example,
case class MyClass(id: UUID, items: Map[String, Double], name: String)
Consider a data source that has data in 500mb files, and doesn't support
predicate push down. Spark will have to load all the data into memory before it
can apply filtering, select "columns" etc. Each 500mb file will at some point
have to be loaded entirely in memory. Now consider a data source
I'm using Spark Streaming to process a large number of files (10s of millions)
from a single directory in S3. Using sparkContext.textFile or wholeTextFile
takes ages and doesn't do anything. Pointing Structured Streaming to that
location seems to work, but after processing all the input, it
Guess the good people in the Cassandra world are stuck in the past making
indexes, materialized views, etc. better with every release :)
From: mich.talebza...@gmail.com
Date: Mon, 15 Aug 2016 11:11:03 +0100
Subject: Re: Does Spark SQL support indexes?
To: gourav.sengu...@gmail.com
CC:
k/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2478
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Wed, Aug 10, 2016 at 10:24 AM, Ashic
.count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count()
b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count()
And then join..
On Aug 8, 2016, at 8:1
) and .count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count()
b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count()
And then join..
On Aug 8, 2016, at 8:17 PM,
Hi,Is there a way to simulate "networked" spark when running local (i.e.
master=local[4])? Ideally, some setting that'll ensure any "Task not
serializable" errors are caught during local testing? I seem to vaguely
remember something, but am having trouble pinpointing it.
Cheers,Ashic.
emory on the driver, increase your memory.
Speaking of which, a filtering step might also help on the above, i.e., filter
the bigRDD with the keys of the Map before joining.
Hope this helps,Anastasios
On Tue, Aug 9, 2016 at 4:46 PM, Ashic Mahtab <as...@live.com> wrote:
Hi Sam,Yup
the results. It should
not take more than 40 mins in a 32 GB RAM system with 6 core processors.
Gourav
On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab <as...@live.com> wrote:
Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM
fine, disk a couple of hundred gig).
W
in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 9 August 2016 at 15:46, Ashic Mahtab <as...@live.com> wrote:
Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's
no progress. The spark UI doesn't even show up
; user@spark.apache.org
Have you tried to broadcast your small table table in order to perform your
join ?
joined = bigDF.join(broadcast(smallDF, )
On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab <as...@live.com> wrote:
Hi Deepak,No...not really. Upping the disk size is a solution, bu
like
what helped in this scenario.
ThanksDeepak
On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab <as...@live.com> wrote:
Hi Deepak,Thanks for the response.
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.
g 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org
Register you dataframes as temp tables and then try the join on the temp
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <as...@
Hello,We have two parquet inputs of the following form:
a: id:String, Name:String (1.5TB)b: id:String, Number:Int (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the
Grüßen / Sincères salutations
M. Lohith Samaga
From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04, 2016 15.06
To: Apache Spark
Subject: RE: Cluster mode deployment from jar in S3
Sorry to do this...but... *bump*
From:
as...@live.com
To: user@spark.apache.org
access key aid and secret access key when you
initially configured it.
Is your s3 bucket without any access restrictions?
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04
Sorry to do this...but... *bump*
From: as...@live.com
To: user@spark.apache.org
Subject: Cluster mode deployment from jar in S3
Date: Fri, 1 Jul 2016 17:45:12 +0100
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit
jobs using "--deploy-mode client", however using
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit
jobs using "--deploy-mode client", however using "--deploy-mode cluster" is
proving to be a challenge. I've tries this:
spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster
Spark will execute as a client for hdfs. In other words, it'll contact the
hadoop master for the hdfs cluster, which will return the block info, and then
the data will be fetched from the data nodes.
Date: Tue, 19 Apr 2016 14:00:31 +0530
Subject: Spark + HDFS
From: chaturvedich...@gmail.com
To:
I spent ages on this recently, and here's what I found:
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/on.executor.properties"
works. Alternatively, you can also do:
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=filename.properties"
@spark.apache.org
Hi Ashic,
Unfortunately I don't know how to work around that - I suggested this line as
it looked promising (I had considered it once before deciding to use a
different algorithm) but I never actually tried it.
Regards,
James
On 13 April 2016 at 02:29, Ashic Mahtab <as...@live.
ight work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest classifier. If I try
to save the pipeline, it complains that the classifier is not Writable, and
indeed the classifier itself doesn't have a
will need to write code in the
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest class
to write code in the
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest classifier. If
rite code in the
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest classifier. If I try
to save the pipelin
Hello,I'm trying to save a pipeline with a random forest classifier. If I try
to save the pipeline, it complains that the classifier is not Writable, and
indeed the classifier itself doesn't have a write function. There's a pull
request that's been merged that enables this for Spark 2.0 (any
Spark may not be the right tool for this. Working on just the mobile device,
you won't be scaling out stuff, and as such most of the benefits of Spark would
be nullified. Moreover, it'd likely run slower than things that are meant to
work in a single process. Spark is also quite large, which is
Hello,Is it possible to specify additional class paths / java options "in
addition to" those specified in spark-defaults.conf? I see that if I specify
spark.executor.extraJavaOptions or spark.executor.extraClassPaths in defaults,
and then specify --conf
src/main/resources/log4j.properties
Subject: Re: log4j pains
From: st...@memeticlabs.org
Date: Thu, 10 Mar 2016 11:08:46 -0600
CC: user@spark.apache.org
To: as...@live.com
Where in the jar is the log4j.properties file?
On Mar 10, 2016, at 9:40 AM, Ashic Mahtab <as...@live.com> wrote:1. F
Hello,I'm trying to use a custom log4j appender, with things specified in a
log4j.properties file. Very little seems to work in this regard. Here's what
I've tried:
1. Fat jar with logging dependencies included. log4j.properties in fat jar.
Spark doesn't pick up the properties file, so uses its
Found it.
You can pass in the jvm parameter log4j.configuration. The following works:
-Dlog4j.configuration=file:path/to/log4j.properties
It doesn't work without the file: prefix though. Tested in 1.6.0.
Cheers,Ashic.
From: as...@live.com
To: user@spark.apache.org
Subject: Specify log4j
Hello,Is it possible to provide a log4j properties file when submitting jobs to
a cluster? I know that by default spark looks for a log4j.properties file in
the conf directory. I'm looking for a way to specify a different
log4j.properties file (external to the application) without pointing to a
Hi Ewan,Transforms are definitions of what needs to be done - they don't
execute until and action is triggered. For what you want, I think you might
need to have an action that writes out rdds to some sort of buffered writer.
-Ashic.
From: ewan.le...@realitymine.com
To: user@spark.apache.org
Hello,I'm trying to work offline with spark-core. I've got an empty project
with the following:
name := "sbtSand"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"joda-time" % "joda-time" % "2.9.1",
"org.apache.spark" %% "spark-core" % "1.5.2"
)
I can "sbt
To answer my own question, it appears certain tihngs (like parents, etc.)
caused the issue. I was using sbt 0.13.8. Using 0.13.9 works fine.
From: as...@live.com
To: user@spark.apache.org
Subject: Working offline with spark-core and sbt
Date: Thu, 31 Dec 2015 02:07:26 +
Hello,I'm trying
When you say tasks, do you mean different applications, or different tasks in
the same application? If it's the same program, they should be able to share
the broadcasted value. But given you're asking the question, I imagine they're
separate.
And in that case, afaik, the answer is no. You
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming,
you'll give up at least one core for receiving, will need at most one more core
for processing. Unless you're running on something like Mesos, this means that
those cores are dedicated to your app, and can't be
and thats where my concern is.
TIA
Ayan
On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab as...@live.com wrote:
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming,
you'll give up at least one core for receiving, will need at most one more core
for processing. Unless you're
Unfortunately, afaik that project is long dead.
It'd be an interesting project to create an intermediary protocol, perhaps
using something that nearly everything these days understand (unfortunately [!]
that might be JavaScript). For example, instead of pickling language
constructs, it might be
Spark comes with quite a few components. At it's core is..surprisespark
core. This provides the core things required to run spark jobs. Spark provides
a lot of operators out of the box...take a look at
?!?
From: guha.a...@gmail.com
To: as...@live.com
CC: user@spark.apache.org
It's a problem since 1.3 I think
On 26 Jun 2015 04:00, Ashic Mahtab as...@live.com wrote:
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've
noticed the following:
* On 1.4, sc.textFile(D:\\folder
Date: Fri, 26 Jun 2015 08:54:31 +
On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.com wrote:
Thanks for the replies, guys.
Is this a permanent change as of 1.3, or will it go away at some point?
Don't blame the spark team, complain to the hadoop team for being slow
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've
noticed the following:
* On 1.4, sc.textFile(D:\\folder\\).collect() fails from both spark-shell.cmd
and when running a scala application referencing the spark-core package from
maven.*
Hi Francisco,While I haven't tried this, have a look at the contents of
start-thriftserver.sh - all it's doing is setting up a few variables and
calling:
/bin/spark-submit --class
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
and passing some additional parameters. Perhaps doing the
Hi,I've seen a few articles where they CqlStorageHandler to create hive tables
referencing Cassandra data using the thriftserver. Is there a secret to getting
this to work? I've basically got Spark built with Hive, and a Cassandra
cluster. Is there a way to get the hive server to talk to
Two questions regarding worker cleanup:
1) Is the best place to enable worker cleanup setting
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=30 in conf/spark-env.sh for each worker? Or is
there a better place?
2) I see this has a default TTL of 7
Is there an easy way to check if a spark binary release was built with Hive
support? Are any of the prebuilt binaries on the spark website built with hive
support?
Thanks,Ashic.
are built with -Phive except the 'without-hive' build.
On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab as...@live.com wrote:
Is there an easy way to check if a spark binary release was built with Hive
support? Are any of the prebuilt binaries on the spark website built with
hive support
You could look at using Cassandra for storage. Spark integrates nicely with
Cassandra, and a combination of Spark + Cassandra would give you fast access to
structured data in Cassandra, while enabling analytic scenarios via Spark.
Cassandra would take care of the replication, as it's one of the
the main script on sleep for some time (say 2 minutes).ThanksBest
Regards
On Sat, Jan 24, 2015 at 1:57 AM, Ashic Mahtab as...@live.com wrote:
Hello,
I'm trying to kick off a spark streaming job to a stand alone master using
spark submit inside of init.d. This is what I have:
DAEMON=spark
Hello,
I'm trying to kick off a spark streaming job to a stand alone master using
spark submit inside of init.d. This is what I have:
DAEMON=spark-submit --class Streamer --executor-memory 500M
--total-executor-cores 4 /path/to/assembly.jar
start() {
$DAEMON -p
setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some docs that would tell me if these are expected
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some docs that would tell me if these are expected
or not.
From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015
delay!! Can you tell me more about
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some
in 4.961 secs (median) to 106msgs in 4,761
seconds. I think there's evidence that setup costs are quite high in this case
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com
wrote:
Hi Ashic Mahtab,
The Cassandra
To: as...@live.com
CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com
Hi Ashic Mahtab,
The Cassandra and the Zookeeper are they installed as a part of Yarn
architecture or are they installed in a separate layer with Apache Spark .
Thanks and Regards,
Sudipta
On Thu, Jan 22
tathagata.das1...@gmail.com
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump
) to 106msgs in 4,761
seconds. I think there's evidence that setup costs are quite high in this case
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com
wrote:
Hi Ashic Mahtab,
The Cassandra and the Zookeeper are they installed
Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's
features for analysis, rather than high throughput). Messages are coming in
throughout the day, at around 1-20 per second (finger in the air estimate...not
analysed yet). In the spark streaming UI for the
Hi,
For client mode spark submits of applications, we can do the following:
def createStreamingContext() = {
...
val sc = new SparkContext(conf)
// Create a StreamingContext with a 1 second batch size
val ssc = new StreamingContext(sc, Seconds(1))
}
...
val ssc =
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate
4 cores to a streaming application. I can do this via spark submit by:
spark-submit --total-executor-cores 4
However, this assigns one core per machine. I would like to use 2 cores on 2
machines instead,
To: as...@live.com
CC: user@spark.apache.org
I think you want:
--num-executors 2 --executor-cores 2
On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to
dedicate 4 cores to a streaming application. I
-executors` is not available for standalone clusters. In
standalone mode, you must start new workers on your node as it is a
1:1 ratio of workers to executors.
On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote:
Hi Sean,
Thanks for the response.
It seems --num-executors
Hi,Say we have an operation that writes something to an external resource and
gets some output. For example:
val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {val
result = session.SomeOp(entry)SomeOutput(entry.Key, result.SomeProp)}
I could use a transformation for
Thanks Sean. That's kind of what I figured. Luckily, for my use case writes are
idempotent, so map works.
From: so...@cloudera.com
Date: Fri, 19 Dec 2014 11:06:51 +
Subject: Re: How to run an action and get output?
To: as...@live.com
CC: user@spark.apache.org
To really be correct, I
Hi Guys,
Are scala lazy values instantiated once per executor, or once per partition?
For example, if I have:
object Something =
val lazy context = create()
def foo(item) = context.doSomething(item)
and I do
someRdd.foreach(Something.foo)
then will context get instantiated once per
...@gmail.com
Date: Fri, 19 Dec 2014 12:52:23 +0100
Subject: Re: Scala Lazy values and partitions
To: as...@live.com
CC: user@spark.apache.org
It will be instantiated once per VM, which translates to once per executor.
-kr, Gerard.
On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote:
Hi
Hello,
Say, I have the following code:
let something = Something()
someRdd.foreachRdd(something.someMethod)
And in something, I have a lazy member variable that gets created in
something.someMethod.
Would that lazy be created once per node, or once per partition?
Thanks,
Ashic.
is killed
(when the sparkContext is closed).
TD
On Fri, Dec 12, 2014 at 11:51 PM, Ashic Mahtab as...@live.com wrote:
Looks like the way to go.
Quick question regarding the connection pool approach - if I have a
connection that gets lazily instantiated, will it automatically die if I
kill
11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote:
That makes sense. I'll try that.
Thanks :)
From: tathagata.das1...@gmail.com
Date: Thu, 11 Dec 2014 04:53:01 -0800
Subject: Re: Session for connections?
To: as...@live.com
CC: user@spark.apache.org
You could create
Hi,
I was wondering if there's any way of having long running session type
behaviour in spark. For example, let's say we're using Spark Streaming to
listen to a stream of events. Upon receiving an event, we process it, and if
certain conditions are met, we wish to send a message to rabbitmq.
to shut them down. You
could have a usage timeout - shutdown connection after not being used
for 10 x batch interval.
TD
On Thu, Dec 11, 2014 at 4:28 AM, Ashic Mahtab as...@live.com wrote:
Hi,
I was wondering if there's any way of having long running session type
behaviour in spark
On Dec 6, 2014, at 8:37 AM, Ashic Mahtab as...@live.com wrote:Hi,It appears
that spark is always attempting to use the driver's hostname to connect /
broadcast. This is usually fine, except when the cluster doesn't have DNS
configured. For example, in a vagrant cluster with a private network
Hi,It appears that spark is always attempting to use the driver's hostname to
connect / broadcast. This is usually fine, except when the cluster doesn't have
DNS configured. For example, in a vagrant cluster with a private network. The
workers and masters, and the host (where the driver runs
+/dependency /dependencies build
outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory
You can use the following command:mvn -pl core,streaming package -DskipTests
Cheers
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Seems adding
PM, Ashic Mahtab as...@live.com wrote:
Update:
It seems the following combo causes things in spark streaming to go missing:
spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0
The moment I add the three together, things like StreamingContext and Seconds
are unavailable. sbt
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Seconds is not part of org.apache.spark.streaming and object Receiver is not a
member of package org.apache.spark.streaming.receiver. If I
I've done this:
1. foreachPartition
2. Open connection.
3. foreach inside the partition.
4. close the connection.
Slightly crufty, but works. Would love to see a better approach.
Regards,
Ashic.
Date: Fri, 5 Dec 2014 12:32:24 -0500
Subject: Spark Streaming Reusing JDBC Connections
From:
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Seconds is not part of org.apache.spark.streaming and object Receiver
the following command:mvn -pl core,streaming package -DskipTests
Cheers
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra
Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking
the guava exclusions might help?
Date: Mon, 1 Dec 2014 10:48:25 +0100
Subject: Kryo exception for CassandraSQLRow
From: shahab.mok...@gmail.com
Hi,
I'm looking to do an iterative algorithm implementation with data coming in
from Cassandra. This might be a use case for GraphX, however the ids are
non-integral, and I would like to avoid a mapping (for now). I'm doing a simple
hubs and authorities HITS implementation, and the current
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly to create a uber jar to submit to the stand alone master. I'm
using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do
sc.CassandraTable(...) I get an error that's likely to be a Guava versioning
://github.com/datastax/spark-cassandra-connector/issues/292
best,/Shahab
On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote:
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly to create a uber jar to submit to the stand alone master. I'm
using
Spark can do Map Reduce and more, and faster.
One area where using MR would make sense is if you're using something (maybe
like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible
now...just pulled that name out of thin air!).
You *can* use Spark from Java, but you'd have a
Hi Jerome,
I've been trying to get this working as well...
Where are you specifying cassandra parameters (i.e. seed nodes, consistency
levels, etc.)?
-Ashic.
Date: Thu, 20 Nov 2014 10:34:58 -0700
From: jer...@gmail.com
To: u...@spark.incubator.apache.org
Subject: Re: tableau spark sql
Hi Ben,I haven't tried it with Python, but the instructions are the same as for
Scala compiled (jar) apps. What it's saying is that it's not possible to
offload the entire work to the master (ala hadoop) in a fire and forget (or
rather submit-and-forget) manner when running on stand alone.
into the inability
to share the SparkContext feature and it took a lot of finagling to
make things work (but it never felt production ready).
Ognen
On Sat, Nov 15, 2014 at 03:36:43PM +, Ashic Mahtab wrote:
Hi Ben,I haven't tried it with Python, but the instructions are the same as
for Scala
jar not found :(
Seems if I create a directory sim link so that the share path in the same on
the unix mount point as in windows, and submit from the drive where the mount
point is, then it works. Granted, that's quite an ugly hack.
Reverting to serving jar off http (i.e. using a relative
Hi,
I'm trying to submit a spark application fro network share to the spark master.
Network shares are configured so that the master and all nodes have access to
the target ja at (say):
\\shares\publish\Spark\app1\someJar.jar
And this is mounted on each linux box (i.e. master and workers) at:
Hello,
I'm hoping to understand exactly what happens when a spark compiled app is
submitted to a spark stand-alone cluster master. Say, our master is A, and
workers are W1 and W2. Client machine C is submitting an app to the master
using spark-submit. Here's what I think happens?
* C submits
Hello,I'm trying to find the best way of redeploying a spark streaming
application. Ideally, I was thinking of a scenario where a build server
packages up a jar and a deployment step submits it to a Spark Master. On the
next successful build, the next version would get deployed taking down the
Hi,
The docs specify that we can control the amount of ram / cores available via:
-c CORES, --cores CORESTotal CPU cores to allow Spark applications to use on
the machine (default: all available); only on worker-m MEM, --memory MEMTotal
amount of memory to allow Spark applications to use on the
version
However, workers should be able to re-register since 1.2, since this patch
https://github.com/apache/spark/pull/2828 was merged
Best,
-- Nan Zhu
On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote
Hi Andrejs,The calculations are a bit different to what I've come across in
Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available
here:http://www.mmds.org/
Their calculation of IDF is as follows:
IDFi = log2(N / ni)
where N is the number of documents and ni is the number
I'm quite interested in this as well. I remember something about a streaming
context needing one core. If that's the case, then won't 10 apps require 10
cores? Seems like a waste unless each topic is quite resource hungry? Would
love to hear from the experts :)
Date: Mon, 27 Oct 2014 06:35:29
1 - 100 of 107 matches
Mail list logo