Executors idle, driver heap exploding and maxing only 1 cpu core

2019-05-23 Thread Ashic Mahtab
Hi, We have a quite long winded Spark application we inherited with many stages. When we run on our spark cluster, things start off well enough. Workers are busy, lots of progress made, etc. etc. However, 30 minutes into processing, we see CPU usage of the workers drop drastically. At this

Re: Unable to broadcast a very large variable

2019-04-10 Thread Ashic Mahtab
Default is 10mb. Depends on memory available, and what the network transfer effects are going to be. You can specify spark.sql.autoBroadcastJoinThreshold to increase the threshold in case of spark sql. But you definitely shouldn't be broadcasting gigabytes.

spark and STS tokens (Federation Tokens)

2018-09-26 Thread Ashic Mahtab
Hi, I'm looking to have spark jobs access S3 with temporary credentials. I've seen some examples around AssumeRole, but I have a scenario where the temp credentials are provided by GetFederationToken. Is there anything that can help, or do I need to use boto to execute GetFederationToken, and

Easily creating custom encoders

2017-03-21 Thread Ashic Mahtab
I'm trying to easily create custom encoders for case classes having "unfriendly" fields. I could just kryo the whole thing, but would like to at least have a few fields in the schema instead of one binary blob. For example, case class MyClass(id: UUID, items: Map[String, Double], name: String)

Re: How does predicate push down really help?

2016-11-16 Thread Ashic Mahtab
Consider a data source that has data in 500mb files, and doesn't support predicate push down. Spark will have to load all the data into memory before it can apply filtering, select "columns" etc. Each 500mb file will at some point have to be loaded entirely in memory. Now consider a data source

Straming - Stop when there's no more data

2016-11-15 Thread Ashic Mahtab
I'm using Spark Streaming to process a large number of files (10s of millions) from a single directory in S3. Using sparkContext.textFile or wholeTextFile takes ages and doesn't do anything. Pointing Structured Streaming to that location seems to work, but after processing all the input, it

RE: Does Spark SQL support indexes?

2016-08-15 Thread Ashic Mahtab
Guess the good people in the Cassandra world are stuck in the past making indexes, materialized views, etc. better with every release :) From: mich.talebza...@gmail.com Date: Mon, 15 Aug 2016 11:11:03 +0100 Subject: Re: Does Spark SQL support indexes? To: gourav.sengu...@gmail.com CC:

RE: Simulate serialization when running local

2016-08-15 Thread Ashic Mahtab
k/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2478 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Aug 10, 2016 at 10:24 AM, Ashic

RE: Spark join and large temp files

2016-08-12 Thread Ashic Mahtab
.count() to force a shuffle, it'll push the records that will be joined to the same executors. So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count() b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count() And then join.. On Aug 8, 2016, at 8:1

RE: Spark join and large temp files

2016-08-11 Thread Ashic Mahtab
) and .count() to force a shuffle, it'll push the records that will be joined to the same executors. So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count() b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count() And then join.. On Aug 8, 2016, at 8:17 PM,

Simulate serialization when running local

2016-08-10 Thread Ashic Mahtab
Hi,Is there a way to simulate "networked" spark when running local (i.e. master=local[4])? Ideally, some setting that'll ensure any "Task not serializable" errors are caught during local testing? I seem to vaguely remember something, but am having trouble pinpointing it. Cheers,Ashic.

RE: Spark join and large temp files

2016-08-10 Thread Ashic Mahtab
emory on the driver, increase your memory. Speaking of which, a filtering step might also help on the above, i.e., filter the bigRDD with the keys of the Map before joining. Hope this helps,Anastasios On Tue, Aug 9, 2016 at 4:46 PM, Ashic Mahtab <as...@live.com> wrote: Hi Sam,Yup

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
the results. It should not take more than 40 mins in a 32 GB RAM system with 6 core processors. Gourav On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab <as...@live.com> wrote: Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM fine, disk a couple of hundred gig). W

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
in no case be liable for any monetary damages arising from such loss, damage or destruction. On 9 August 2016 at 15:46, Ashic Mahtab <as...@live.com> wrote: Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's no progress. The spark UI doesn't even show up

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
; user@spark.apache.org Have you tried to broadcast your small table table in order to perform your join ? joined = bigDF.join(broadcast(smallDF, ) On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab <as...@live.com> wrote: Hi Deepak,No...not really. Upping the disk size is a solution, bu

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
like what helped in this scenario. ThanksDeepak On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab <as...@live.com> wrote: Hi Deepak,Thanks for the response. Registering the temp tables didn't help. Here's what I have: val a = sqlContext..read.parquet(...).select("eid.

RE: Spark join and large temp files

2016-08-08 Thread Ashic Mahtab
g 2016 00:01:32 +0530 Subject: Re: Spark join and large temp files To: as...@live.com CC: user@spark.apache.org Register you dataframes as temp tables and then try the join on the temp table.This should resolve your issue. ThanksDeepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <as...@

Spark join and large temp files

2016-08-08 Thread Ashic Mahtab
Hello,We have two parquet inputs of the following form: a: id:String, Name:String (1.5TB)b: id:String, Number:Int (1.3GB) We need to join these two to get (id, Number, Name). We've tried two approaches: a.join(b, Seq("id"), "right_outer") where a and b are dataframes. We also tried taking the

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab [mailto:as...@live.com] Sent: Monday, July 04, 2016 15.06 To: Apache Spark Subject: RE: Cluster mode deployment from jar in S3 Sorry to do this...but... *bump* From: as...@live.com To: user@spark.apache.org

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
access key aid and secret access key when you initially configured it. Is your s3 bucket without any access restrictions? Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab [mailto:as...@live.com] Sent: Monday, July 04

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Sorry to do this...but... *bump* From: as...@live.com To: user@spark.apache.org Subject: Cluster mode deployment from jar in S3 Date: Fri, 1 Jul 2016 17:45:12 +0100 Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs using "--deploy-mode client", however using

Cluster mode deployment from jar in S3

2016-07-01 Thread Ashic Mahtab
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs using "--deploy-mode client", however using "--deploy-mode cluster" is proving to be a challenge. I've tries this: spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster

RE: Spark + HDFS

2016-04-19 Thread Ashic Mahtab
Spark will execute as a client for hdfs. In other words, it'll contact the hadoop master for the hdfs cluster, which will return the block info, and then the data will be fetched from the data nodes. Date: Tue, 19 Apr 2016 14:00:31 +0530 Subject: Spark + HDFS From: chaturvedich...@gmail.com To:

RE: Logging in executors

2016-04-18 Thread Ashic Mahtab
I spent ages on this recently, and here's what I found: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/on.executor.properties" works. Alternatively, you can also do: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=filename.properties"

RE: ML Random Forest Classifier

2016-04-13 Thread Ashic Mahtab
@spark.apache.org Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab <as...@live.

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
ight work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest class

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If

RE: ML Random Forest Classifier

2016-04-11 Thread Ashic Mahtab
rite code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipelin

ML Random Forest Classifier

2016-04-11 Thread Ashic Mahtab
Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any

RE: Spark on Mobile platforms

2016-04-07 Thread Ashic Mahtab
Spark may not be the right tool for this. Working on just the mobile device, you won't be scaling out stuff, and as such most of the benefits of Spark would be nullified. Moreover, it'd likely run slower than things that are meant to work in a single process. Spark is also quite large, which is

Additional classpaths / java options

2016-03-22 Thread Ashic Mahtab
Hello,Is it possible to specify additional class paths / java options "in addition to" those specified in spark-defaults.conf? I see that if I specify spark.executor.extraJavaOptions or spark.executor.extraClassPaths in defaults, and then specify --conf

RE: log4j pains

2016-03-10 Thread Ashic Mahtab
src/main/resources/log4j.properties Subject: Re: log4j pains From: st...@memeticlabs.org Date: Thu, 10 Mar 2016 11:08:46 -0600 CC: user@spark.apache.org To: as...@live.com Where in the jar is the log4j.properties file? On Mar 10, 2016, at 9:40 AM, Ashic Mahtab <as...@live.com> wrote:1. F

log4j pains

2016-03-10 Thread Ashic Mahtab
Hello,I'm trying to use a custom log4j appender, with things specified in a log4j.properties file. Very little seems to work in this regard. Here's what I've tried: 1. Fat jar with logging dependencies included. log4j.properties in fat jar. Spark doesn't pick up the properties file, so uses its

RE: Specify log4j properties file

2016-03-09 Thread Ashic Mahtab
Found it. You can pass in the jvm parameter log4j.configuration. The following works: -Dlog4j.configuration=file:path/to/log4j.properties It doesn't work without the file: prefix though. Tested in 1.6.0. Cheers,Ashic. From: as...@live.com To: user@spark.apache.org Subject: Specify log4j

Specify log4j properties file

2016-03-09 Thread Ashic Mahtab
Hello,Is it possible to provide a log4j properties file when submitting jobs to a cluster? I know that by default spark looks for a log4j.properties file in the conf directory. I'm looking for a way to specify a different log4j.properties file (external to the application) without pointing to a

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ashic Mahtab
Hi Ewan,Transforms are definitions of what needs to be done - they don't execute until and action is triggered. For what you want, I think you might need to have an action that writes out rdds to some sort of buffered writer. -Ashic. From: ewan.le...@realitymine.com To: user@spark.apache.org

Working offline with spark-core and sbt

2015-12-30 Thread Ashic Mahtab
Hello,I'm trying to work offline with spark-core. I've got an empty project with the following: name := "sbtSand" version := "1.0" scalaVersion := "2.11.7" libraryDependencies ++= Seq( "joda-time" % "joda-time" % "2.9.1", "org.apache.spark" %% "spark-core" % "1.5.2" ) I can "sbt

RE: Working offline with spark-core and sbt

2015-12-30 Thread Ashic Mahtab
To answer my own question, it appears certain tihngs (like parents, etc.) caused the issue. I was using sbt 0.13.8. Using 0.13.9 works fine. From: as...@live.com To: user@spark.apache.org Subject: Working offline with spark-core and sbt Date: Thu, 31 Dec 2015 02:07:26 + Hello,I'm trying

RE: Spark Broadcasting large dataset

2015-07-10 Thread Ashic Mahtab
When you say tasks, do you mean different applications, or different tasks in the same application? If it's the same program, they should be able to share the broadcasted value. But given you're asking the question, I imagine they're separate. And in that case, afaik, the answer is no. You

RE: JDBC Streams

2015-07-05 Thread Ashic Mahtab
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're running on something like Mesos, this means that those cores are dedicated to your app, and can't be

RE: JDBC Streams

2015-07-05 Thread Ashic Mahtab
and thats where my concern is. TIA Ayan On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab as...@live.com wrote: Hi Ayan,How continuous is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're

RE: .NET on Apache Spark?

2015-07-05 Thread Ashic Mahtab
Unfortunately, afaik that project is long dead. It'd be an interesting project to create an intermediary protocol, perhaps using something that nearly everything these days understand (unfortunately [!] that might be JavaScript). For example, instead of pickling language constructs, it might be

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Ashic Mahtab
Spark comes with quite a few components. At it's core is..surprisespark core. This provides the core things required to run spark jobs. Spark provides a lot of operators out of the box...take a look at

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
?!? From: guha.a...@gmail.com To: as...@live.com CC: user@spark.apache.org It's a problem since 1.3 I think On 26 Jun 2015 04:00, Ashic Mahtab as...@live.com wrote: Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile(D:\\folder

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Date: Fri, 26 Jun 2015 08:54:31 + On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.com wrote: Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Don't blame the spark team, complain to the hadoop team for being slow

Recent spark sc.textFile needs hadoop for folders?!?

2015-06-25 Thread Ashic Mahtab
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile(D:\\folder\\).collect() fails from both spark-shell.cmd and when running a scala application referencing the spark-core package from maven.*

RE: Spark SQL odbc on Windows

2015-02-22 Thread Ashic Mahtab
Hi Francisco,While I haven't tried this, have a look at the contents of start-thriftserver.sh - all it's doing is setting up a few variables and calling: /bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and passing some additional parameters. Perhaps doing the

Hive, Spark, Cassandra, Tableau, BI, etc.

2015-02-17 Thread Ashic Mahtab
Hi,I've seen a few articles where they CqlStorageHandler to create hive tables referencing Cassandra data using the thriftserver. Is there a secret to getting this to work? I've basically got Spark built with Hive, and a Cassandra cluster. Is there a way to get the hive server to talk to

Cleanup Questions

2015-02-17 Thread Ashic Mahtab
Two questions regarding worker cleanup: 1) Is the best place to enable worker cleanup setting export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=30 in conf/spark-env.sh for each worker? Or is there a better place? 2) I see this has a default TTL of 7

Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
Is there an easy way to check if a spark binary release was built with Hive support? Are any of the prebuilt binaries on the spark website built with hive support? Thanks,Ashic.

RE: Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
are built with -Phive except the 'without-hive' build. On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab as...@live.com wrote: Is there an easy way to check if a spark binary release was built with Hive support? Are any of the prebuilt binaries on the spark website built with hive support

RE: Full per node replication level (architecture question)

2015-01-24 Thread Ashic Mahtab
You could look at using Cassandra for storage. Spark integrates nicely with Cassandra, and a combination of Spark + Cassandra would give you fast access to structured data in Cassandra, while enabling analytic scenarios via Spark. Cassandra would take care of the replication, as it's one of the

RE: Starting a spark streaming app in init.d

2015-01-24 Thread Ashic Mahtab
the main script on sleep for some time (say 2 minutes).ThanksBest Regards On Sat, Jan 24, 2015 at 1:57 AM, Ashic Mahtab as...@live.com wrote: Hello, I'm trying to kick off a spark streaming job to a stand alone master using spark submit inside of init.d. This is what I have: DAEMON=spark

Starting a spark streaming app in init.d

2015-01-23 Thread Ashic Mahtab
Hello, I'm trying to kick off a spark streaming job to a stand alone master using spark submit inside of init.d. This is what I have: DAEMON=spark-submit --class Streamer --executor-memory 500M --total-executor-cores 4 /path/to/assembly.jar start() { $DAEMON -p

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
To: as...@live.com CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed

Are these numbers abnormal for spark streaming?

2015-01-21 Thread Ashic Mahtab
Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the

Can multiple streaming apps use the same checkpoint directory?

2015-01-20 Thread Ashic Mahtab
Hi, For client mode spark submits of applications, we can do the following: def createStreamingContext() = { ... val sc = new SparkContext(conf) // Create a StreamingContext with a 1 second batch size val ssc = new StreamingContext(sc, Seconds(1)) } ... val ssc =

Using more cores on machines

2014-12-22 Thread Ashic Mahtab
Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead,

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab
To: as...@live.com CC: user@spark.apache.org I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab
-executors` is not available for standalone clusters. In standalone mode, you must start new workers on your node as it is a 1:1 ratio of workers to executors. On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote: Hi Sean, Thanks for the response. It seems --num-executors

How to run an action and get output?

2014-12-19 Thread Ashic Mahtab
Hi,Say we have an operation that writes something to an external resource and gets some output. For example: val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {val result = session.SomeOp(entry)SomeOutput(entry.Key, result.SomeProp)} I could use a transformation for

RE: How to run an action and get output?‏

2014-12-19 Thread Ashic Mahtab
Thanks Sean. That's kind of what I figured. Luckily, for my use case writes are idempotent, so map works. From: so...@cloudera.com Date: Fri, 19 Dec 2014 11:06:51 + Subject: Re: How to run an action and get output?‏ To: as...@live.com CC: user@spark.apache.org To really be correct, I

Scala Lazy values and partitions

2014-12-19 Thread Ashic Mahtab
Hi Guys, Are scala lazy values instantiated once per executor, or once per partition? For example, if I have: object Something = val lazy context = create() def foo(item) = context.doSomething(item) and I do someRdd.foreach(Something.foo) then will context get instantiated once per

RE: Scala Lazy values and partitions

2014-12-19 Thread Ashic Mahtab
...@gmail.com Date: Fri, 19 Dec 2014 12:52:23 +0100 Subject: Re: Scala Lazy values and partitions To: as...@live.com CC: user@spark.apache.org It will be instantiated once per VM, which translates to once per executor. -kr, Gerard. On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote: Hi

Are lazy values created once per node or once per partition?

2014-12-17 Thread Ashic Mahtab
Hello, Say, I have the following code: let something = Something() someRdd.foreachRdd(something.someMethod) And in something, I have a lazy member variable that gets created in something.someMethod. Would that lazy be created once per node, or once per partition? Thanks, Ashic.

RE: Session for connections?

2014-12-13 Thread Ashic Mahtab
is killed (when the sparkContext is closed). TD On Fri, Dec 12, 2014 at 11:51 PM, Ashic Mahtab as...@live.com wrote: Looks like the way to go. Quick question regarding the connection pool approach - if I have a connection that gets lazily instantiated, will it automatically die if I kill

RE: Session for connections?

2014-12-12 Thread Ashic Mahtab
11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote: That makes sense. I'll try that. Thanks :) From: tathagata.das1...@gmail.com Date: Thu, 11 Dec 2014 04:53:01 -0800 Subject: Re: Session for connections? To: as...@live.com CC: user@spark.apache.org You could create

Session for connections?

2014-12-11 Thread Ashic Mahtab
Hi, I was wondering if there's any way of having long running session type behaviour in spark. For example, let's say we're using Spark Streaming to listen to a stream of events. Upon receiving an event, we process it, and if certain conditions are met, we wish to send a message to rabbitmq.

RE: Session for connections?

2014-12-11 Thread Ashic Mahtab
to shut them down. You could have a usage timeout - shutdown connection after not being used for 10 x batch interval. TD On Thu, Dec 11, 2014 at 4:28 AM, Ashic Mahtab as...@live.com wrote: Hi, I was wondering if there's any way of having long running session type behaviour in spark

RE: Is there a way to force spark to use specific ips?

2014-12-07 Thread Ashic Mahtab
On Dec 6, 2014, at 8:37 AM, Ashic Mahtab as...@live.com wrote:Hi,It appears that spark is always attempting to use the driver's hostname to connect / broadcast. This is usually fine, except when the cluster doesn't have DNS configured. For example, in a vagrant cluster with a private network

Is there a way to force spark to use specific ips?

2014-12-06 Thread Ashic Mahtab
Hi,It appears that spark is always attempting to use the driver's hostname to connect / broadcast. This is usually fine, except when the cluster doesn't have DNS configured. For example, in a vagrant cluster with a private network. The workers and masters, and the host (where the driver runs

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
+/dependency /dependencies build outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory You can use the following command:mvn -pl core,streaming package -DskipTests Cheers On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote: Hi, Seems adding

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
PM, Ashic Mahtab as...@live.com wrote: Update: It seems the following combo causes things in spark streaming to go missing: spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0 The moment I add the three together, things like StreamingContext and Seconds are unavailable. sbt

Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver is not a member of package org.apache.spark.streaming.receiver. If I

RE: Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Ashic Mahtab
I've done this: 1. foreachPartition 2. Open connection. 3. foreach inside the partition. 4. close the connection. Slightly crufty, but works. Would love to see a better approach. Regards, Ashic. Date: Fri, 5 Dec 2014 12:32:24 -0500 Subject: Spark Streaming Reusing JDBC Connections From:

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote: Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
the following command:mvn -pl core,streaming package -DskipTests Cheers On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote: Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like

RE: Kryo exception for CassandraSQLRow

2014-12-01 Thread Ashic Mahtab
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking the guava exclusions might help? Date: Mon, 1 Dec 2014 10:48:25 +0100 Subject: Kryo exception for CassandraSQLRow From: shahab.mok...@gmail.com

Best way to do a lookup in Spark

2014-11-27 Thread Ashic Mahtab
Hi, I'm looking to do an iterative algorithm implementation with data coming in from Cassandra. This might be a use case for GraphX, however the ids are non-integral, and I would like to avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and the current

Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to create a uber jar to submit to the stand alone master. I'm using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do sc.CassandraTable(...) I get an error that's likely to be a Guava versioning

RE: Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
://github.com/datastax/spark-cassandra-connector/issues/292 best,/Shahab On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote: I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to create a uber jar to submit to the stand alone master. I'm using

RE: Spark or MR, Scala or Java?

2014-11-22 Thread Ashic Mahtab
Spark can do Map Reduce and more, and faster. One area where using MR would make sense is if you're using something (maybe like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible now...just pulled that name out of thin air!). You *can* use Spark from Java, but you'd have a

RE: tableau spark sql cassandra

2014-11-20 Thread Ashic Mahtab
Hi Jerome, I've been trying to get this working as well... Where are you specifying cassandra parameters (i.e. seed nodes, consistency levels, etc.)? -Ashic. Date: Thu, 20 Nov 2014 10:34:58 -0700 From: jer...@gmail.com To: u...@spark.incubator.apache.org Subject: Re: tableau spark sql

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone.

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
into the inability to share the SparkContext feature and it took a lot of finagling to make things work (but it never felt production ready). Ognen On Sat, Nov 15, 2014 at 03:36:43PM +, Ashic Mahtab wrote: Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala

RE: Spark-submit and Windows / Linux mixed network

2014-11-12 Thread Ashic Mahtab
jar not found :( Seems if I create a directory sim link so that the share path in the same on the unix mount point as in windows, and submit from the drive where the mount point is, then it works. Granted, that's quite an ugly hack. Reverting to serving jar off http (i.e. using a relative

Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ashic Mahtab
Hi, I'm trying to submit a spark application fro network share to the spark master. Network shares are configured so that the master and all nodes have access to the target ja at (say): \\shares\publish\Spark\app1\someJar.jar And this is mounted on each linux box (i.e. master and workers) at:

Solidifying Understanding of Standalone Mode

2014-11-10 Thread Ashic Mahtab
Hello, I'm hoping to understand exactly what happens when a spark compiled app is submitted to a spark stand-alone cluster master. Say, our master is A, and workers are W1 and W2. Client machine C is submitting an app to the master using spark-submit. Here's what I think happens? * C submits

Redploying a spark streaming application

2014-11-06 Thread Ashic Mahtab
Hello,I'm trying to find the best way of redeploying a spark streaming application. Ideally, I was thinking of a scenario where a build server packages up a jar and a deployment step submits it to a Spark Master. On the next successful build, the next version would get deployed taking down the

Standalone Specify mem / cores defaults

2014-11-05 Thread Ashic Mahtab
Hi, The docs specify that we can control the amount of ram / cores available via: -c CORES, --cores CORESTotal CPU cores to allow Spark applications to use on the machine (default: all available); only on worker-m MEM, --memory MEMTotal amount of memory to allow Spark applications to use on the

RE: Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab
version However, workers should be able to re-register since 1.2, since this patch https://github.com/apache/spark/pull/2828 was merged Best, -- Nan Zhu On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote

RE: how idf is calculated

2014-10-30 Thread Ashic Mahtab
Hi Andrejs,The calculations are a bit different to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here:http://www.mmds.org/ Their calculation of IDF is as follows: IDFi = log2(N / ni) where N is the number of documents and ni is the number

RE: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Ashic Mahtab
I'm quite interested in this as well. I remember something about a streaming context needing one core. If that's the case, then won't 10 apps require 10 cores? Seems like a waste unless each topic is quite resource hungry? Would love to hear from the experts :) Date: Mon, 27 Oct 2014 06:35:29

  1   2   >