Standalone Apps and ClassNotFound

2014-10-16 Thread Ashic Mahtab
I'm relatively new to Spark and have got a couple of questions: * I've got an IntelliJ SBT project that's using Spark Streaming with a custom RabbitMQ receiver in the same project. When I run it against local[2], all's well. When I put in spark://masterip:7077, I get a ClassNotFoundException

Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab
Hi, Is there a simple way to run spark sql queries against Sql Server databases? Or are we limited to running sql and doing sc.Parallelize()? Being able to query small amounts of lookup info directly from spark can save a bunch of annoying etl, and I'd expect Spark Sql to have some way of doing

RE: Getting Spark SQL talking to Sql Server

2014-10-21 Thread Ashic Mahtab
Instead of using Spark SQL, you can use JdbcRDD to extract data from SQL server. Currently Spark SQL can't run queries against SQL server. The foreign data source API planned in Spark 1.2 can make this possible. On 10/21/14 6:26 PM, Ashic Mahtab wrote

RE: Python vs Scala performance

2014-10-22 Thread Ashic Mahtab
I'm no expert, but looked into how the python bits work a while back (was trying to assess what it would take to add F# support). It seems python hosts a jvm inside of it, and talks to scala spark in that jvm. The python server bit translates the python calls to those in the jvm. The python

Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra

RE: Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
and write it back to Cassandra. Kr, Gerard On Oct 23, 2014 1:21 PM, Ashic Mahtab as...@live.com wrote: I'm looking to use spark for some ETL, which will mostly consist of update statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work

RE: Spark Cassandra Connector proper usage

2014-10-23 Thread Ashic Mahtab
. On Thu, Oct 23, 2014 at 2:48 PM, Ashic Mahtab as...@live.com wrote: Hi Gerard, Thanks for the response. Here's the scenario: The target cassandra schema looks like this: create table foo ( id text primary key, bar int, things settext ) The source in question is a Sql Server source

RE: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Ashic Mahtab
I'm quite interested in this as well. I remember something about a streaming context needing one core. If that's the case, then won't 10 apps require 10 cores? Seems like a waste unless each topic is quite resource hungry? Would love to hear from the experts :) Date: Mon, 27 Oct 2014 06:35:29

RE: how idf is calculated

2014-10-30 Thread Ashic Mahtab
Hi Andrejs,The calculations are a bit different to what I've come across in Mining Massive Datasets (2nd Ed. Ullman et. al., Cambridge Press) available here:http://www.mmds.org/ Their calculation of IDF is as follows: IDFi = log2(N / ni) where N is the number of documents and ni is the number

RE: Workers not registering after master restart

2014-11-04 Thread Ashic Mahtab
version However, workers should be able to re-register since 1.2, since this patch https://github.com/apache/spark/pull/2828 was merged Best, -- Nan Zhu On Tuesday, November 4, 2014 at 6:00 PM, Ashic Mahtab wrote

Standalone Specify mem / cores defaults

2014-11-05 Thread Ashic Mahtab
Hi, The docs specify that we can control the amount of ram / cores available via: -c CORES, --cores CORESTotal CPU cores to allow Spark applications to use on the machine (default: all available); only on worker-m MEM, --memory MEMTotal amount of memory to allow Spark applications to use on the

Redploying a spark streaming application

2014-11-06 Thread Ashic Mahtab
Hello,I'm trying to find the best way of redeploying a spark streaming application. Ideally, I was thinking of a scenario where a build server packages up a jar and a deployment step submits it to a Spark Master. On the next successful build, the next version would get deployed taking down the

Solidifying Understanding of Standalone Mode

2014-11-10 Thread Ashic Mahtab
Hello, I'm hoping to understand exactly what happens when a spark compiled app is submitted to a spark stand-alone cluster master. Say, our master is A, and workers are W1 and W2. Client machine C is submitting an app to the master using spark-submit. Here's what I think happens? * C submits

Spark-submit and Windows / Linux mixed network

2014-11-11 Thread Ashic Mahtab
Hi, I'm trying to submit a spark application fro network share to the spark master. Network shares are configured so that the master and all nodes have access to the target ja at (say): \\shares\publish\Spark\app1\someJar.jar And this is mounted on each linux box (i.e. master and workers) at:

RE: Spark-submit and Windows / Linux mixed network

2014-11-12 Thread Ashic Mahtab
jar not found :( Seems if I create a directory sim link so that the share path in the same on the unix mount point as in windows, and submit from the drive where the mount point is, then it works. Granted, that's quite an ugly hack. Reverting to serving jar off http (i.e. using a relative

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala compiled (jar) apps. What it's saying is that it's not possible to offload the entire work to the master (ala hadoop) in a fire and forget (or rather submit-and-forget) manner when running on stand alone.

RE: Submitting Python Applications from Remote to Master

2014-11-15 Thread Ashic Mahtab
into the inability to share the SparkContext feature and it took a lot of finagling to make things work (but it never felt production ready). Ognen On Sat, Nov 15, 2014 at 03:36:43PM +, Ashic Mahtab wrote: Hi Ben,I haven't tried it with Python, but the instructions are the same as for Scala

RE: tableau spark sql cassandra

2014-11-20 Thread Ashic Mahtab
Hi Jerome, I've been trying to get this working as well... Where are you specifying cassandra parameters (i.e. seed nodes, consistency levels, etc.)? -Ashic. Date: Thu, 20 Nov 2014 10:34:58 -0700 From: jer...@gmail.com To: u...@spark.incubator.apache.org Subject: Re: tableau spark sql

RE: Spark or MR, Scala or Java?

2014-11-22 Thread Ashic Mahtab
Spark can do Map Reduce and more, and faster. One area where using MR would make sense is if you're using something (maybe like Mahout) that doesn't understand Spark yet (Mahout may be Spark compatible now...just pulled that name out of thin air!). You *can* use Spark from Java, but you'd have a

Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to create a uber jar to submit to the stand alone master. I'm using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do sc.CassandraTable(...) I get an error that's likely to be a Guava versioning

RE: Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
://github.com/datastax/spark-cassandra-connector/issues/292 best,/Shahab On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote: I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to create a uber jar to submit to the stand alone master. I'm using

Best way to do a lookup in Spark

2014-11-27 Thread Ashic Mahtab
Hi, I'm looking to do an iterative algorithm implementation with data coming in from Cassandra. This might be a use case for GraphX, however the ids are non-integral, and I would like to avoid a mapping (for now). I'm doing a simple hubs and authorities HITS implementation, and the current

RE: Kryo exception for CassandraSQLRow

2014-12-01 Thread Ashic Mahtab
Don't know if this'll solve it, but if you're on Spark 1.1, the Cassandra Connector version 1.1.0 final fixed the guava back compat issue. Maybe taking the guava exclusions might help? Date: Mon, 1 Dec 2014 10:48:25 +0100 Subject: Kryo exception for CassandraSQLRow From: shahab.mok...@gmail.com

Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver is not a member of package org.apache.spark.streaming.receiver. If I

RE: Spark Streaming Reusing JDBC Connections

2014-12-05 Thread Ashic Mahtab
I've done this: 1. foreachPartition 2. Open connection. 3. foreach inside the partition. 4. close the connection. Slightly crufty, but works. Would love to see a better approach. Regards, Ashic. Date: Fri, 5 Dec 2014 12:32:24 -0500 Subject: Spark Streaming Reusing JDBC Connections From:

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote: Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like Seconds is not part of org.apache.spark.streaming and object Receiver

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-05 Thread Ashic Mahtab
the following command:mvn -pl core,streaming package -DskipTests Cheers On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote: Hi, Seems adding the cassandra connector and spark streaming causes issues. I've added by build and code file. Running sbt compile gives weird errors like

Is there a way to force spark to use specific ips?

2014-12-06 Thread Ashic Mahtab
Hi,It appears that spark is always attempting to use the driver's hostname to connect / broadcast. This is usually fine, except when the cluster doesn't have DNS configured. For example, in a vagrant cluster with a private network. The workers and masters, and the host (where the driver runs

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
+/dependency /dependencies build outputDirectorytarget/scala-${scala.binary.version}/classes/outputDirectory You can use the following command:mvn -pl core,streaming package -DskipTests Cheers On Fri, Dec 5, 2014 at 9:35 AM, Ashic Mahtab as...@live.com wrote: Hi, Seems adding

RE: Adding Spark Cassandra dependency breaks Spark Streaming?

2014-12-06 Thread Ashic Mahtab
PM, Ashic Mahtab as...@live.com wrote: Update: It seems the following combo causes things in spark streaming to go missing: spark-core 1.1.0spark-streaming 1.1.0spark-cassandra-connector 1.1.0 The moment I add the three together, things like StreamingContext and Seconds are unavailable. sbt

RE: Is there a way to force spark to use specific ips?

2014-12-07 Thread Ashic Mahtab
On Dec 6, 2014, at 8:37 AM, Ashic Mahtab as...@live.com wrote:Hi,It appears that spark is always attempting to use the driver's hostname to connect / broadcast. This is usually fine, except when the cluster doesn't have DNS configured. For example, in a vagrant cluster with a private network

Session for connections?

2014-12-11 Thread Ashic Mahtab
Hi, I was wondering if there's any way of having long running session type behaviour in spark. For example, let's say we're using Spark Streaming to listen to a stream of events. Upon receiving an event, we process it, and if certain conditions are met, we wish to send a message to rabbitmq.

RE: Session for connections?

2014-12-11 Thread Ashic Mahtab
to shut them down. You could have a usage timeout - shutdown connection after not being used for 10 x batch interval. TD On Thu, Dec 11, 2014 at 4:28 AM, Ashic Mahtab as...@live.com wrote: Hi, I was wondering if there's any way of having long running session type behaviour in spark

RE: Session for connections?

2014-12-12 Thread Ashic Mahtab
11, 2014 at 4:55 AM, Ashic Mahtab as...@live.com wrote: That makes sense. I'll try that. Thanks :) From: tathagata.das1...@gmail.com Date: Thu, 11 Dec 2014 04:53:01 -0800 Subject: Re: Session for connections? To: as...@live.com CC: user@spark.apache.org You could create

RE: Session for connections?

2014-12-13 Thread Ashic Mahtab
is killed (when the sparkContext is closed). TD On Fri, Dec 12, 2014 at 11:51 PM, Ashic Mahtab as...@live.com wrote: Looks like the way to go. Quick question regarding the connection pool approach - if I have a connection that gets lazily instantiated, will it automatically die if I kill

Are lazy values created once per node or once per partition?

2014-12-17 Thread Ashic Mahtab
Hello, Say, I have the following code: let something = Something() someRdd.foreachRdd(something.someMethod) And in something, I have a lazy member variable that gets created in something.someMethod. Would that lazy be created once per node, or once per partition? Thanks, Ashic.

How to run an action and get output?

2014-12-19 Thread Ashic Mahtab
Hi,Say we have an operation that writes something to an external resource and gets some output. For example: val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {val result = session.SomeOp(entry)SomeOutput(entry.Key, result.SomeProp)} I could use a transformation for

RE: How to run an action and get output?‏

2014-12-19 Thread Ashic Mahtab
Thanks Sean. That's kind of what I figured. Luckily, for my use case writes are idempotent, so map works. From: so...@cloudera.com Date: Fri, 19 Dec 2014 11:06:51 + Subject: Re: How to run an action and get output?‏ To: as...@live.com CC: user@spark.apache.org To really be correct, I

Scala Lazy values and partitions

2014-12-19 Thread Ashic Mahtab
Hi Guys, Are scala lazy values instantiated once per executor, or once per partition? For example, if I have: object Something = val lazy context = create() def foo(item) = context.doSomething(item) and I do someRdd.foreach(Something.foo) then will context get instantiated once per

RE: Scala Lazy values and partitions

2014-12-19 Thread Ashic Mahtab
...@gmail.com Date: Fri, 19 Dec 2014 12:52:23 +0100 Subject: Re: Scala Lazy values and partitions To: as...@live.com CC: user@spark.apache.org It will be instantiated once per VM, which translates to once per executor. -kr, Gerard. On Fri, Dec 19, 2014 at 12:21 PM, Ashic Mahtab as...@live.com wrote: Hi

Using more cores on machines

2014-12-22 Thread Ashic Mahtab
Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I can do this via spark submit by: spark-submit --total-executor-cores 4 However, this assigns one core per machine. I would like to use 2 cores on 2 machines instead,

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab
To: as...@live.com CC: user@spark.apache.org I think you want: --num-executors 2 --executor-cores 2 On Mon, Dec 22, 2014 at 10:39 AM, Ashic Mahtab as...@live.com wrote: Hi, Say we have 4 nodes with 2 cores each in stand alone mode. I'd like to dedicate 4 cores to a streaming application. I

RE: Using more cores on machines

2014-12-22 Thread Ashic Mahtab
-executors` is not available for standalone clusters. In standalone mode, you must start new workers on your node as it is a 1:1 ratio of workers to executors. On 22 December 2014 at 12:25, Ashic Mahtab as...@live.com wrote: Hi Sean, Thanks for the response. It seems --num-executors

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some docs that would tell me if these are expected or not. From: as...@live.com To: user@spark.apache.org Subject: Are these numbers abnormal for spark streaming? Date: Wed, 21 Jan 2015

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump? Would really appreciate input from others using Streaming. Or at least some

Starting a spark streaming app in init.d

2015-01-23 Thread Ashic Mahtab
Hello, I'm trying to kick off a spark streaming job to a stand alone master using spark submit inside of init.d. This is what I have: DAEMON=spark-submit --class Streamer --executor-memory 500M --total-executor-cores 4 /path/to/assembly.jar start() { $DAEMON -p

RE: Full per node replication level (architecture question)

2015-01-24 Thread Ashic Mahtab
You could look at using Cassandra for storage. Spark integrates nicely with Cassandra, and a combination of Spark + Cassandra would give you fast access to structured data in Cassandra, while enabling analytic scenarios via Spark. Cassandra would take care of the replication, as it's one of the

Can multiple streaming apps use the same checkpoint directory?

2015-01-20 Thread Ashic Mahtab
Hi, For client mode spark submits of applications, we can do the following: def createStreamingContext() = { ... val sc = new SparkContext(conf) // Create a StreamingContext with a 1 second batch size val ssc = new StreamingContext(sc, Seconds(1)) } ... val ssc =

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
in 4.961 secs (median) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
To: as...@live.com CC: gerard.m...@gmail.com; user@spark.apache.org; tathagata.das1...@gmail.com Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed as a part of Yarn architecture or are they installed in a separate layer with Apache Spark . Thanks and Regards, Sudipta On Thu, Jan 22

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
tathagata.das1...@gmail.com wrote: This is not normal. Its a huge scheduling delay!! Can you tell me more about the application?- cluser setup, number of receivers, whats the computation, etc. On Thu, Jan 22, 2015 at 3:11 AM, Ashic Mahtab as...@live.com wrote: Hate to do this...but...erm...bump

RE: Are these numbers abnormal for spark streaming?

2015-01-22 Thread Ashic Mahtab
) to 106msgs in 4,761 seconds. I think there's evidence that setup costs are quite high in this case and increasing the batch interval is helping. On Thu, Jan 22, 2015 at 4:12 PM, Sudipta Banerjee asudipta.baner...@gmail.com wrote: Hi Ashic Mahtab, The Cassandra and the Zookeeper are they installed

Are these numbers abnormal for spark streaming?

2015-01-21 Thread Ashic Mahtab
Hi Guys, I've got Spark Streaming set up for a low data rate system (using spark's features for analysis, rather than high throughput). Messages are coming in throughout the day, at around 1-20 per second (finger in the air estimate...not analysed yet). In the spark streaming UI for the

Hive, Spark, Cassandra, Tableau, BI, etc.

2015-02-17 Thread Ashic Mahtab
Hi,I've seen a few articles where they CqlStorageHandler to create hive tables referencing Cassandra data using the thriftserver. Is there a secret to getting this to work? I've basically got Spark built with Hive, and a Cassandra cluster. Is there a way to get the hive server to talk to

Cleanup Questions

2015-02-17 Thread Ashic Mahtab
Two questions regarding worker cleanup: 1) Is the best place to enable worker cleanup setting export SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=30 in conf/spark-env.sh for each worker? Or is there a better place? 2) I see this has a default TTL of 7

RE: Starting a spark streaming app in init.d

2015-01-24 Thread Ashic Mahtab
the main script on sleep for some time (say 2 minutes).ThanksBest Regards On Sat, Jan 24, 2015 at 1:57 AM, Ashic Mahtab as...@live.com wrote: Hello, I'm trying to kick off a spark streaming job to a stand alone master using spark submit inside of init.d. This is what I have: DAEMON=spark

Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
Is there an easy way to check if a spark binary release was built with Hive support? Are any of the prebuilt binaries on the spark website built with hive support? Thanks,Ashic.

RE: Check if spark was built with hive

2015-02-09 Thread Ashic Mahtab
are built with -Phive except the 'without-hive' build. On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab as...@live.com wrote: Is there an easy way to check if a spark binary release was built with Hive support? Are any of the prebuilt binaries on the spark website built with hive support

RE: Spark SQL odbc on Windows

2015-02-22 Thread Ashic Mahtab
Hi Francisco,While I haven't tried this, have a look at the contents of start-thriftserver.sh - all it's doing is setting up a few variables and calling: /bin/spark-submit --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and passing some additional parameters. Perhaps doing the

Recent spark sc.textFile needs hadoop for folders?!?

2015-06-25 Thread Ashic Mahtab
Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile(D:\\folder\\).collect() fails from both spark-shell.cmd and when running a scala application referencing the spark-core package from maven.*

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
?!? From: guha.a...@gmail.com To: as...@live.com CC: user@spark.apache.org It's a problem since 1.3 I think On 26 Jun 2015 04:00, Ashic Mahtab as...@live.com wrote: Hello,Just trying out spark 1.4 (we're using 1.1 at present). On Windows, I've noticed the following: * On 1.4, sc.textFile(D:\\folder

RE: Recent spark sc.textFile needs hadoop for folders?!?

2015-06-26 Thread Ashic Mahtab
Date: Fri, 26 Jun 2015 08:54:31 + On 26 Jun 2015, at 09:29, Ashic Mahtab as...@live.com wrote: Thanks for the replies, guys. Is this a permanent change as of 1.3, or will it go away at some point? Don't blame the spark team, complain to the hadoop team for being slow

RE: JDBC Streams

2015-07-05 Thread Ashic Mahtab
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're running on something like Mesos, this means that those cores are dedicated to your app, and can't be

RE: JDBC Streams

2015-07-05 Thread Ashic Mahtab
and thats where my concern is. TIA Ayan On Mon, Jul 6, 2015 at 12:51 AM, Ashic Mahtab as...@live.com wrote: Hi Ayan,How continuous is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're

RE: .NET on Apache Spark?

2015-07-05 Thread Ashic Mahtab
Unfortunately, afaik that project is long dead. It'd be an interesting project to create an intermediary protocol, perhaps using something that nearly everything these days understand (unfortunately [!] that might be JavaScript). For example, instead of pickling language constructs, it might be

RE: What does Spark is not just MapReduce mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Ashic Mahtab
Spark comes with quite a few components. At it's core is..surprisespark core. This provides the core things required to run spark jobs. Spark provides a lot of operators out of the box...take a look at

RE: Spark Broadcasting large dataset

2015-07-10 Thread Ashic Mahtab
When you say tasks, do you mean different applications, or different tasks in the same application? If it's the same program, they should be able to share the broadcasted value. But given you're asking the question, I imagine they're separate. And in that case, afaik, the answer is no. You

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ashic Mahtab
Hi Ewan,Transforms are definitions of what needs to be done - they don't execute until and action is triggered. For what you want, I think you might need to have an action that writes out rdds to some sort of buffered writer. -Ashic. From: ewan.le...@realitymine.com To: user@spark.apache.org

Working offline with spark-core and sbt

2015-12-30 Thread Ashic Mahtab
Hello,I'm trying to work offline with spark-core. I've got an empty project with the following: name := "sbtSand" version := "1.0" scalaVersion := "2.11.7" libraryDependencies ++= Seq( "joda-time" % "joda-time" % "2.9.1", "org.apache.spark" %% "spark-core" % "1.5.2" ) I can "sbt

RE: Working offline with spark-core and sbt

2015-12-30 Thread Ashic Mahtab
To answer my own question, it appears certain tihngs (like parents, etc.) caused the issue. I was using sbt 0.13.8. Using 0.13.9 works fine. From: as...@live.com To: user@spark.apache.org Subject: Working offline with spark-core and sbt Date: Thu, 31 Dec 2015 02:07:26 + Hello,I'm trying

Specify log4j properties file

2016-03-09 Thread Ashic Mahtab
Hello,Is it possible to provide a log4j properties file when submitting jobs to a cluster? I know that by default spark looks for a log4j.properties file in the conf directory. I'm looking for a way to specify a different log4j.properties file (external to the application) without pointing to a

RE: Specify log4j properties file

2016-03-09 Thread Ashic Mahtab
Found it. You can pass in the jvm parameter log4j.configuration. The following works: -Dlog4j.configuration=file:path/to/log4j.properties It doesn't work without the file: prefix though. Tested in 1.6.0. Cheers,Ashic. From: as...@live.com To: user@spark.apache.org Subject: Specify log4j

RE: log4j pains

2016-03-10 Thread Ashic Mahtab
src/main/resources/log4j.properties Subject: Re: log4j pains From: st...@memeticlabs.org Date: Thu, 10 Mar 2016 11:08:46 -0600 CC: user@spark.apache.org To: as...@live.com Where in the jar is the log4j.properties file? On Mar 10, 2016, at 9:40 AM, Ashic Mahtab <as...@live.com> wrote:1. F

log4j pains

2016-03-10 Thread Ashic Mahtab
Hello,I'm trying to use a custom log4j appender, with things specified in a log4j.properties file. Very little seems to work in this regard. Here's what I've tried: 1. Fat jar with logging dependencies included. log4j.properties in fat jar. Spark doesn't pick up the properties file, so uses its

RE: Spark on Mobile platforms

2016-04-07 Thread Ashic Mahtab
Spark may not be the right tool for this. Working on just the mobile device, you won't be scaling out stuff, and as such most of the benefits of Spark would be nullified. Moreover, it'd likely run slower than things that are meant to work in a single process. Spark is also quite large, which is

RE: ML Random Forest Classifier

2016-04-13 Thread Ashic Mahtab
@spark.apache.org Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab <as...@live.

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
ight work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
will need to write code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest class

ML Random Forest Classifier

2016-04-11 Thread Ashic Mahtab
Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipeline, it complains that the classifier is not Writable, and indeed the classifier itself doesn't have a write function. There's a pull request that's been merged that enables this for Spark 2.0 (any

RE: ML Random Forest Classifier

2016-04-11 Thread Ashic Mahtab
rite code in the org.apache.spark.ml package. I've not actually tried doing this myself but it looks as if it might work. Regards, James On 11 April 2016 at 10:29, Ashic Mahtab <as...@live.com> wrote: Hello,I'm trying to save a pipeline with a random forest classifier. If I try to save the pipelin

Additional classpaths / java options

2016-03-22 Thread Ashic Mahtab
Hello,Is it possible to specify additional class paths / java options "in addition to" those specified in spark-defaults.conf? I see that if I specify spark.executor.extraJavaOptions or spark.executor.extraClassPaths in defaults, and then specify --conf

RE: Logging in executors

2016-04-18 Thread Ashic Mahtab
I spent ages on this recently, and here's what I found: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///local/file/on.executor.properties" works. Alternatively, you can also do: --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=filename.properties"

RE: Spark + HDFS

2016-04-19 Thread Ashic Mahtab
Spark will execute as a client for hdfs. In other words, it'll contact the hadoop master for the hdfs cluster, which will return the block info, and then the data will be fetched from the data nodes. Date: Tue, 19 Apr 2016 14:00:31 +0530 Subject: Spark + HDFS From: chaturvedich...@gmail.com To:

Simulate serialization when running local

2016-08-10 Thread Ashic Mahtab
Hi,Is there a way to simulate "networked" spark when running local (i.e. master=local[4])? Ideally, some setting that'll ensure any "Task not serializable" errors are caught during local testing? I seem to vaguely remember something, but am having trouble pinpointing it. Cheers,Ashic.

RE: Spark join and large temp files

2016-08-11 Thread Ashic Mahtab
) and .count() to force a shuffle, it'll push the records that will be joined to the same executors. So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count() b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count() And then join.. On Aug 8, 2016, at 8:17 PM,

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
in no case be liable for any monetary damages arising from such loss, damage or destruction. On 9 August 2016 at 15:46, Ashic Mahtab <as...@live.com> wrote: Hi Sam,Yup. It seems it stalls when broadcasting. CPU goes to 100%, but there's no progress. The spark UI doesn't even show up

RE: Simulate serialization when running local

2016-08-15 Thread Ashic Mahtab
k/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2478 Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Wed, Aug 10, 2016 at 10:24 AM, Ashic

RE: Does Spark SQL support indexes?

2016-08-15 Thread Ashic Mahtab
Guess the good people in the Cassandra world are stuck in the past making indexes, materialized views, etc. better with every release :) From: mich.talebza...@gmail.com Date: Mon, 15 Aug 2016 11:11:03 +0100 Subject: Re: Does Spark SQL support indexes? To: gourav.sengu...@gmail.com CC:

RE: Spark join and large temp files

2016-08-12 Thread Ashic Mahtab
.count() to force a shuffle, it'll push the records that will be joined to the same executors. So;a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()a.count() b = spark.read.parquet(‘path_to_table_b').repartition(‘id’).cache()b.count() And then join.. On Aug 8, 2016, at 8:1

RE: Spark join and large temp files

2016-08-10 Thread Ashic Mahtab
emory on the driver, increase your memory. Speaking of which, a filtering step might also help on the above, i.e., filter the bigRDD with the keys of the Map before joining. Hope this helps,Anastasios On Tue, Aug 9, 2016 at 4:46 PM, Ashic Mahtab <as...@live.com> wrote: Hi Sam,Yup

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab [mailto:as...@live.com] Sent: Monday, July 04, 2016 15.06 To: Apache Spark Subject: RE: Cluster mode deployment from jar in S3 Sorry to do this...but... *bump* From: as...@live.com To: user@spark.apache.org

Spark join and large temp files

2016-08-08 Thread Ashic Mahtab
Hello,We have two parquet inputs of the following form: a: id:String, Name:String (1.5TB)b: id:String, Number:Int (1.3GB) We need to join these two to get (id, Number, Name). We've tried two approaches: a.join(b, Seq("id"), "right_outer") where a and b are dataframes. We also tried taking the

RE: Spark join and large temp files

2016-08-08 Thread Ashic Mahtab
g 2016 00:01:32 +0530 Subject: Re: Spark join and large temp files To: as...@live.com CC: user@spark.apache.org Register you dataframes as temp tables and then try the join on the temp table.This should resolve your issue. ThanksDeepak On Mon, Aug 8, 2016 at 11:47 PM, Ashic Mahtab <as...@

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
like what helped in this scenario. ThanksDeepak On Tue, Aug 9, 2016 at 12:23 AM, Ashic Mahtab <as...@live.com> wrote: Hi Deepak,Thanks for the response. Registering the temp tables didn't help. Here's what I have: val a = sqlContext..read.parquet(...).select("eid.

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
; user@spark.apache.org Have you tried to broadcast your small table table in order to perform your join ? joined = bigDF.join(broadcast(smallDF, ) On Tue, Aug 9, 2016 at 3:29 PM, Ashic Mahtab <as...@live.com> wrote: Hi Deepak,No...not really. Upping the disk size is a solution, bu

RE: Spark join and large temp files

2016-08-09 Thread Ashic Mahtab
the results. It should not take more than 40 mins in a 32 GB RAM system with 6 core processors. Gourav On Tue, Aug 9, 2016 at 6:02 PM, Ashic Mahtab <as...@live.com> wrote: Hi Mich,Hardware: AWS EMR cluster with 15 nodes with Rx3.2xlarge (CPU, RAM fine, disk a couple of hundred gig). W

Cluster mode deployment from jar in S3

2016-07-01 Thread Ashic Mahtab
Hello,I've got a Spark stand-alone cluster using EC2 instances. I can submit jobs using "--deploy-mode client", however using "--deploy-mode cluster" is proving to be a challenge. I've tries this: spark-submit --class foo --master spark:://master-ip:7077 --deploy-mode cluster

RE: Cluster mode deployment from jar in S3

2016-07-04 Thread Ashic Mahtab
access key aid and secret access key when you initially configured it. Is your s3 bucket without any access restrictions? Best regards / Mit freundlichen Grüßen / Sincères salutations M. Lohith Samaga From: Ashic Mahtab [mailto:as...@live.com] Sent: Monday, July 04

  1   2   >