I'm relatively new to Spark and have got a couple of questions:
*
I've got an IntelliJ SBT project that's using Spark Streaming with a
custom RabbitMQ receiver in the same project. When I run it against
local[2], all's well. When I put in spark://masterip:7077, I get a
ClassNotFoundException
Hi,
Is there a simple way to run spark sql queries against Sql Server databases? Or
are we limited to running sql and doing sc.Parallelize()? Being able to query
small amounts of lookup info directly from spark can save a bunch of annoying
etl, and I'd expect Spark Sql to have some way of doing
Instead of using Spark SQL, you can use JdbcRDD to extract data from
SQL server. Currently Spark SQL can't run queries against SQL
server. The foreign data source API planned in Spark 1.2 can make
this possible.
On 10/21/14 6:26 PM, Ashic Mahtab
wrote
I'm no expert, but looked into how the python bits work a while back (was
trying to assess what it would take to add F# support). It seems python hosts a
jvm inside of it, and talks to scala spark in that jvm. The python server bit
translates the python calls to those in the jvm. The python
I'm looking to use spark for some ETL, which will mostly consist of update
statements (a column is a set, that'll be appended to, so a simple insert is
likely not going to work). As such, it seems like issuing CQL queries to import
the data is the best option. Using the Spark Cassandra
and write
it back to Cassandra.
Kr, Gerard
On Oct 23, 2014 1:21 PM, Ashic Mahtab as...@live.com wrote:
I'm looking to use spark for some ETL, which will mostly consist of update
statements (a column is a set, that'll be appended to, so a simple insert is
likely not going to work
.
On Thu, Oct 23, 2014 at 2:48 PM, Ashic Mahtab as...@live.com wrote:
Hi Gerard,
Thanks for the response. Here's the scenario:
The target cassandra schema looks like this:
create table foo (
id text primary key,
bar int,
things settext
)
The source in question is a Sql Server source
I'm quite interested in this as well. I remember something about a streaming
context needing one core. If that's the case, then won't 10 apps require 10
cores? Seems like a waste unless each topic is quite resource hungry? Would
love to hear from the experts :)
Date: Mon, 27 Oct 2014 06:35:29
Hi Andrejs,The calculations are a bit different to what I've come across in
Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available
here:http://www.mmds.org/
Their calculation of IDF is as follows:
IDFi = log2(N / ni)
where N is the number of documents and ni is the number
version
However, workers should be able to re-register since 1.2, since this patch
https://github.com/apache/spark/pull/2828 was merged
Best,
-- Nan Zhu
On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote
Hi,
The docs specify that we can control the amount of ram / cores available via:
-c CORES, --cores CORESTotal CPU cores to allow Spark applications to use on
the machine (default: all available); only on worker-m MEM, --memory MEMTotal
amount of memory to allow Spark applications to use on the
Hello,I'm trying to find the best way of redeploying a spark streaming
application. Ideally, I was thinking of a scenario where a build server
packages up a jar and a deployment step submits it to a Spark Master. On the
next successful build, the next version would get deployed taking down the
Hello,
I'm hoping to understand exactly what happens when a spark compiled app is
submitted to a spark stand-alone cluster master. Say, our master is A, and
workers are W1 and W2. Client machine C is submitting an app to the master
using spark-submit. Here's what I think happens?
* C submits
Hi,
I'm trying to submit a spark application fro network share to the spark master.
Network shares are configured so that the master and all nodes have access to
the target ja at (say):
\\shares\publish\Spark\app1\someJar.jar
And this is mounted on each linux box (i.e. master and workers) at:
jar not found :(
Seems if I create a directory sim link so that the share path in the same on
the unix mount point as in windows, and submit from the drive where the mount
point is, then it works. Granted, that's quite an ugly hack.
Reverting to serving jar off http (i.e. using a relative
Hi Ben,I haven't tried it with Python, but the instructions are the same as for
Scala compiled (jar) apps. What it's saying is that it's not possible to
offload the entire work to the master (ala hadoop) in a fire and forget (or
rather submit-and-forget) manner when running on stand alone.
into the inability
to share the SparkContext feature and it took a lot of finagling to
make things work (but it never felt production ready).
Ognen
On Sat, Nov 15, 2014 at 03:36:43PM +, Ashic Mahtab wrote:
Hi Ben,I haven't tried it with Python, but the instructions are the same as
for Scala
Hi Jerome,
I've been trying to get this working as well...
Where are you specifying cassandra parameters (i.e. seed nodes, consistency
levels, etc.)?
-Ashic.
Date: Thu, 20 Nov 2014 10:34:58 -0700
From: jer...@gmail.com
To: u...@spark.incubator.apache.org
Subject: Re: tableau spark sql
Spark can do Map Reduce and more, and faster.
One area where using MR would make sense is if you're using something (maybe
like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible
now...just pulled that name out of thin air!).
You *can* use Spark from Java, but you'd have a
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly to create a uber jar to submit to the stand alone master. I'm
using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do
sc.CassandraTable(...) I get an error that's likely to be a Guava versioning
://github.com/datastax/spark-cassandra-connector/issues/292
best,/Shahab
On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote:
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using
sbt-assembly to create a uber jar to submit to the stand alone master. I'm
using
Hi,
I'm looking to do an iterative algorithm implementation with data coming in
from Cassandra. This might be a use case for GraphX, however the ids are
non-integral, and I would like to avoid a mapping (for now). I'm doing a simple
hubs and authorities HITS implementation, and the current
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra
Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking
the guava exclusions might help?
Date: Mon, 1 Dec 2014 10:48:25 +0100
Subject: Kryo exception for CassandraSQLRow
From: shahab.mok...@gmail.com
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Seconds is not part of org.apache.spark.streaming and object Receiver is not a
member of package org.apache.spark.streaming.receiver. If I
I've done this:
1. foreachPartition
2. Open connection.
3. foreach inside the partition.
4. close the connection.
Slightly crufty, but works. Would love to see a better approach.
Regards,
Ashic.
Date: Fri, 5 Dec 2014 12:32:24 -0500
Subject: Spark Streaming Reusing JDBC Connections
From:
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Seconds is not part of org.apache.spark.streaming and object Receiver
the following command:mvn -pl core,streaming package -DskipTests
Cheers
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Seems adding the cassandra connector and spark streaming causes issues. I've
added by build and code file. Running sbt compile gives weird errors like
Hi,It appears that spark is always attempting to use the driver's hostname to
connect / broadcast. This is usually fine, except when the cluster doesn't have
DNS configured. For example, in a vagrant cluster with a private network. The
workers and masters, and the host (where the driver runs
+/dependency /dependencies build
outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory
You can use the following command:mvn -pl core,streaming package -DskipTests
Cheers
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Seems adding
PM, Ashic Mahtab as...@live.com wrote:
Update:
It seems the following combo causes things in spark streaming to go missing:
spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0
The moment I add the three together, things like StreamingContext and Seconds
are unavailable. sbt
On Dec 6, 2014, at 8:37 AM, Ashic Mahtab as...@live.com wrote:Hi,It appears
that spark is always attempting to use the driver's hostname to connect /
broadcast. This is usually fine, except when the cluster doesn't have DNS
configured. For example, in a vagrant cluster with a private network
Hi,
I was wondering if there's any way of having long running session type
behaviour in spark. For example, let's say we're using Spark Streaming to
listen to a stream of events. Upon receiving an event, we process it, and if
certain conditions are met, we wish to send a message to rabbitmq.
to shut them down. You
could have a usage timeout - shutdown connection after not being used
for 10 x batch interval.
TD
On Thu, Dec 11, 2014 at 4:28 AM, Ashic Mahtab as...@live.com wrote:
Hi,
I was wondering if there's any way of having long running session type
behaviour in spark
11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote:
That makes sense. I'll try that.
Thanks :)
From: tathagata.das1...@gmail.com
Date: Thu, 11 Dec 2014 04:53:01 -0800
Subject: Re: Session for connections?
To: as...@live.com
CC: user@spark.apache.org
You could create
is killed
(when the sparkContext is closed).
TD
On Fri, Dec 12, 2014 at 11:51 PM, Ashic Mahtab as...@live.com wrote:
Looks like the way to go.
Quick question regarding the connection pool approach - if I have a
connection that gets lazily instantiated, will it automatically die if I
kill
Hello,
Say, I have the following code:
let something = Something()
someRdd.foreachRdd(something.someMethod)
And in something, I have a lazy member variable that gets created in
something.someMethod.
Would that lazy be created once per node, or once per partition?
Thanks,
Ashic.
Hi,Say we have an operation that writes something to an external resource and
gets some output. For example:
val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {val
result = session.SomeOp(entry)SomeOutput(entry.Key, result.SomeProp)}
I could use a transformation for
Thanks Sean. That's kind of what I figured. Luckily, for my use case writes are
idempotent, so map works.
From: so...@cloudera.com
Date: Fri, 19 Dec 2014 11:06:51 +
Subject: Re: How to run an action and get output?
To: as...@live.com
CC: user@spark.apache.org
To really be correct, I
Hi Guys,
Are scala lazy values instantiated once per executor, or once per partition?
For example, if I have:
object Something =
val lazy context = create()
def foo(item) = context.doSomething(item)
and I do
someRdd.foreach(Something.foo)
then will context get instantiated once per
...@gmail.com
Date: Fri, 19 Dec 2014 12:52:23 +0100
Subject: Re: Scala Lazy values and partitions
To: as...@live.com
CC: user@spark.apache.org
It will be instantiated once per VM, which translates to once per executor.
-kr, Gerard.
On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote:
Hi
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate
4 cores to a streaming application. I can do this via spark submit by:
spark-submit --total-executor-cores 4
However, this assigns one core per machine. I would like to use 2 cores on 2
machines instead,
To: as...@live.com
CC: user@spark.apache.org
I think you want:
--num-executors 2 --executor-cores 2
On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote:
Hi,
Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to
dedicate 4 cores to a streaming application. I
-executors` is not available for standalone clusters. In
standalone mode, you must start new workers on your node as it is a
1:1 ratio of workers to executors.
On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote:
Hi Sean,
Thanks for the response.
It seems --num-executors
setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some docs that would tell me if these are expected
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some docs that would tell me if these are expected
or not.
From: as...@live.com
To: user@spark.apache.org
Subject: Are these numbers abnormal for spark streaming?
Date: Wed, 21 Jan 2015
delay!! Can you tell me more about
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump? Would really appreciate input from others
using Streaming. Or at least some
Hello,
I'm trying to kick off a spark streaming job to a stand alone master using
spark submit inside of init.d. This is what I have:
DAEMON=spark-submit --class Streamer --executor-memory 500M
--total-executor-cores 4 /path/to/assembly.jar
start() {
$DAEMON -p
You could look at using Cassandra for storage. Spark integrates nicely with
Cassandra, and a combination of Spark + Cassandra would give you fast access to
structured data in Cassandra, while enabling analytic scenarios via Spark.
Cassandra would take care of the replication, as it's one of the
Hi,
For client mode spark submits of applications, we can do the following:
def createStreamingContext() = {
...
val sc = new SparkContext(conf)
// Create a StreamingContext with a 1 second batch size
val ssc = new StreamingContext(sc, Seconds(1))
}
...
val ssc =
in 4.961 secs (median) to 106msgs in 4,761
seconds. I think there's evidence that setup costs are quite high in this case
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com
wrote:
Hi Ashic Mahtab,
The Cassandra
To: as...@live.com
CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com
Hi Ashic Mahtab,
The Cassandra and the Zookeeper are they installed as a part of Yarn
architecture or are they installed in a separate layer with Apache Spark .
Thanks and Regards,
Sudipta
On Thu, Jan 22
tathagata.das1...@gmail.com
wrote:
This is not normal. Its a huge scheduling delay!! Can you tell me more about
the application?- cluser setup, number of receivers, whats the computation, etc.
On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote:
Hate to do this...but...erm...bump
) to 106msgs in 4,761
seconds. I think there's evidence that setup costs are quite high in this case
and increasing the batch interval is helping.
On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com
wrote:
Hi Ashic Mahtab,
The Cassandra and the Zookeeper are they installed
Hi Guys,
I've got Spark Streaming set up for a low data rate system (using spark's
features for analysis, rather than high throughput). Messages are coming in
throughout the day, at around 1-20 per second (finger in the air estimate...not
analysed yet). In the spark streaming UI for the
Hi,I've seen a few articles where they CqlStorageHandler to create hive tables
referencing Cassandra data using the thriftserver. Is there a secret to getting
this to work? I've basically got Spark built with Hive, and a Cassandra
cluster. Is there a way to get the hive server to talk to
Two questions regarding worker cleanup:
1) Is the best place to enable worker cleanup setting
export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.interval=30 in conf/spark-env.sh for each worker? Or is
there a better place?
2) I see this has a default TTL of 7
the main script on sleep for some time (say 2 minutes).ThanksBest
Regards
On Sat, Jan 24, 2015 at 1:57 AM, Ashic Mahtab as...@live.com wrote:
Hello,
I'm trying to kick off a spark streaming job to a stand alone master using
spark submit inside of init.d. This is what I have:
DAEMON=spark
Is there an easy way to check if a spark binary release was built with Hive
support? Are any of the prebuilt binaries on the spark website built with hive
support?
Thanks,Ashic.
are built with -Phive except the 'without-hive' build.
On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab as...@live.com wrote:
Is there an easy way to check if a spark binary release was built with Hive
support? Are any of the prebuilt binaries on the spark website built with
hive support
Hi Francisco,While I haven't tried this, have a look at the contents of
start-thriftserver.sh - all it's doing is setting up a few variables and
calling:
/bin/spark-submit --class
org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
and passing some additional parameters. Perhaps doing the
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've
noticed the following:
* On 1.4, sc.textFile(D:\\folder\\).collect() fails from both spark-shell.cmd
and when running a scala application referencing the spark-core package from
maven.*
?!?
From: guha.a...@gmail.com
To: as...@live.com
CC: user@spark.apache.org
It's a problem since 1.3 I think
On 26 Jun 2015 04:00, Ashic Mahtab as...@live.com wrote:
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've
noticed the following:
* On 1.4, sc.textFile(D:\\folder
Date: Fri, 26 Jun 2015 08:54:31 +
On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.com wrote:
Thanks for the replies, guys.
Is this a permanent change as of 1.3, or will it go away at some point?
Don't blame the spark team, complain to the hadoop team for being slow
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming,
you'll give up at least one core for receiving, will need at most one more core
for processing. Unless you're running on something like Mesos, this means that
those cores are dedicated to your app, and can't be
and thats where my concern is.
TIA
Ayan
On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab as...@live.com wrote:
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming,
you'll give up at least one core for receiving, will need at most one more core
for processing. Unless you're
Unfortunately, afaik that project is long dead.
It'd be an interesting project to create an intermediary protocol, perhaps
using something that nearly everything these days understand (unfortunately [!]
that might be JavaScript). For example, instead of pickling language
constructs, it might be
Spark comes with quite a few components. At it's core is..surprisespark
core. This provides the core things required to run spark jobs. Spark provides
a lot of operators out of the box...take a look at
When you say tasks, do you mean different applications, or different tasks in
the same application? If it's the same program, they should be able to share
the broadcasted value. But given you're asking the question, I imagine they're
separate.
And in that case, afaik, the answer is no. You
Hi Ewan,Transforms are definitions of what needs to be done - they don't
execute until and action is triggered. For what you want, I think you might
need to have an action that writes out rdds to some sort of buffered writer.
-Ashic.
From: ewan.le...@realitymine.com
To: user@spark.apache.org
Hello,I'm trying to work offline with spark-core. I've got an empty project
with the following:
name := "sbtSand"
version := "1.0"
scalaVersion := "2.11.7"
libraryDependencies ++= Seq(
"joda-time" % "joda-time" % "2.9.1",
"org.apache.spark" %% "spark-core" % "1.5.2"
)
I can "sbt
To answer my own question, it appears certain tihngs (like parents, etc.)
caused the issue. I was using sbt 0.13.8. Using 0.13.9 works fine.
From: as...@live.com
To: user@spark.apache.org
Subject: Working offline with spark-core and sbt
Date: Thu, 31 Dec 2015 02:07:26 +
Hello,I'm trying
Hello,Is it possible to provide a log4j properties file when submitting jobs to
a cluster? I know that by default spark looks for a log4j.properties file in
the conf directory. I'm looking for a way to specify a different
log4j.properties file (external to the application) without pointing to a
Found it.
You can pass in the jvm parameter log4j.configuration. The following works:
-Dlog4j.configuration=file:path/to/log4j.properties
It doesn't work without the file: prefix though. Tested in 1.6.0.
Cheers,Ashic.
From: as...@live.com
To: user@spark.apache.org
Subject: Specify log4j
src/main/resources/log4j.properties
Subject: Re: log4j pains
From: st...@memeticlabs.org
Date: Thu, 10 Mar 2016 11:08:46 -0600
CC: user@spark.apache.org
To: as...@live.com
Where in the jar is the log4j.properties file?
On Mar 10, 2016, at 9:40 AM, Ashic Mahtab <as...@live.com> wrote:1. F
Hello,I'm trying to use a custom log4j appender, with things specified in a
log4j.properties file. Very little seems to work in this regard. Here's what
I've tried:
1. Fat jar with logging dependencies included. log4j.properties in fat jar.
Spark doesn't pick up the properties file, so uses its
Spark may not be the right tool for this. Working on just the mobile device,
you won't be scaling out stuff, and as such most of the benefits of Spark would
be nullified. Moreover, it'd likely run slower than things that are meant to
work in a single process. Spark is also quite large, which is
@spark.apache.org
Hi Ashic,
Unfortunately I don't know how to work around that - I suggested this line as
it looked promising (I had considered it once before deciding to use a
different algorithm) but I never actually tried it.
Regards,
James
On 13 April 2016 at 02:29, Ashic Mahtab <as...@live.
to write code in the
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest classifier. If
ight work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest classifier. If I try
to save the pipeline, it complains that the classifier is not Writable, and
indeed the classifier itself doesn't have a
will need to write code in the
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest class
Hello,I'm trying to save a pipeline with a random forest classifier. If I try
to save the pipeline, it complains that the classifier is not Writable, and
indeed the classifier itself doesn't have a write function. There's a pull
request that's been merged that enables this for Spark 2.0 (any
rite code in the
org.apache.spark.ml package.
I've not actually tried doing this myself but it looks as if it might work.
Regards,
James
On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote:
Hello,I'm trying to save a pipeline with a random forest classifier. If I try
to save the pipelin
Hello,Is it possible to specify additional class paths / java options "in
addition to" those specified in spark-defaults.conf? I see that if I specify
spark.executor.extraJavaOptions or spark.executor.extraClassPaths in defaults,
and then specify --conf
I spent ages on this recently, and here's what I found:
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/on.executor.properties"
works. Alternatively, you can also do:
--conf
"spark.executor.extraJavaOptions=-Dlog4j.configuration=filename.properties"
Spark will execute as a client for hdfs. In other words, it'll contact the
hadoop master for the hdfs cluster, which will return the block info, and then
the data will be fetched from the data nodes.
Date: Tue, 19 Apr 2016 14:00:31 +0530
Subject: Spark + HDFS
From: chaturvedich...@gmail.com
To:
Hi,Is there a way to simulate "networked" spark when running local (i.e.
master=local[4])? Ideally, some setting that'll ensure any "Task not
serializable" errors are caught during local testing? I seem to vaguely
remember something, but am having trouble pinpointing it.
Cheers,Ashic.
) and .count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count()
b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count()
And then join..
On Aug 8, 2016, at 8:17 PM,
in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On 9 August 2016 at 15:46, Ashic Mahtab <as...@live.com> wrote:
Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's
no progress. The spark UI doesn't even show up
k/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2478
Pozdrawiam,
Jacek Laskowski
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski
On Wed, Aug 10, 2016 at 10:24 AM, Ashic
Guess the good people in the Cassandra world are stuck in the past making
indexes, materialized views, etc. better with every release :)
From: mich.talebza...@gmail.com
Date: Mon, 15 Aug 2016 11:11:03 +0100
Subject: Re: Does Spark SQL support indexes?
To: gourav.sengu...@gmail.com
CC:
.count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count()
b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count()
And then join..
On Aug 8, 2016, at 8:1
emory on the driver, increase your memory.
Speaking of which, a filtering step might also help on the above, i.e., filter
the bigRDD with the keys of the Map before joining.
Hope this helps,Anastasios
On Tue, Aug 9, 2016 at 4:46 PM, Ashic Mahtab <as...@live.com> wrote:
Hi Sam,Yup
Grüßen / Sincères salutations
M. Lohith Samaga
From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04, 2016 15.06
To: Apache Spark
Subject: RE: Cluster mode deployment from jar in S3
Sorry to do this...but... *bump*
From:
as...@live.com
To: user@spark.apache.org
Hello,We have two parquet inputs of the following form:
a: id:String, Name:String (1.5TB)b: id:String, Number:Int (1.3GB)
We need to join these two to get (id, Number, Name). We've tried two approaches:
a.join(b, Seq("id"), "right_outer")
where a and b are dataframes. We also tried taking the
g 2016 00:01:32 +0530
Subject: Re: Spark join and large temp files
To: as...@live.com
CC: user@spark.apache.org
Register you dataframes as temp tables and then try the join on the temp
table.This should resolve your issue.
ThanksDeepak
On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <as...@
like
what helped in this scenario.
ThanksDeepak
On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab <as...@live.com> wrote:
Hi Deepak,Thanks for the response.
Registering the temp tables didn't help. Here's what I have:
val a = sqlContext..read.parquet(...).select("eid.
; user@spark.apache.org
Have you tried to broadcast your small table table in order to perform your
join ?
joined = bigDF.join(broadcast(smallDF, )
On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab <as...@live.com> wrote:
Hi Deepak,No...not really. Upping the disk size is a solution, bu
the results. It should
not take more than 40 mins in a 32 GB RAM system with 6 core processors.
Gourav
On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab <as...@live.com> wrote:
Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM
fine, disk a couple of hundred gig).
W
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit
jobs using "--deploy-mode client", however using "--deploy-mode cluster" is
proving to be a challenge. I've tries this:
spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster
access key aid and secret access key when you
initially configured it.
Is your s3 bucket without any access restrictions?
Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga
From: Ashic Mahtab [mailto:as...@live.com]
Sent: Monday, July 04
1 - 100 of 107 matches
Mail list logo