SparkSQL Dataframe : partitionColumn, lowerBound, upperBound, numPartitions in context of reading from MySQL

2016-03-30 Thread Soumya Simanta
I'm trying to understand what the following configurations mean and their implication on reading data from a MySQL table. I'm looking for options that will impact my read throughput when reading data from a large table. Thanks. partitionColumn, lowerBound, upperBound, numPartitions These

Re: Yarn client mode: Setting environment variables

2016-02-17 Thread Soumya Simanta
Can you give some examples of what variables you are trying to set ? On Thu, Feb 18, 2016 at 1:01 AM, Lin Zhao wrote: > I've been trying to set some environment variables for the spark executors > but haven't had much like. I tried editting conf/spark-env.sh but it > doesn't

Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
, Jan 29, 2015 at 11:35 AM, Soumya Simanta soumya.sima...@gmail.com wrote: On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Does the error change on build with and without the built options? What do you mean by build options? I'm just doing ./sbt/sbt

Re: Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda ar...@sigmoidanalytics.com wrote: Does the error change on build with and without the built options? What do you mean by build options? I'm just doing ./sbt/sbt assembly from $SPARK_HOME Did you try using maven? and doing the proxy settings

Building Spark behind a proxy

2015-01-29 Thread Soumya Simanta
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using ./sbt/sbt assembly and I get the following error. I've set the http and https proxy as well as the JAVA_OPTS. Any idea what am I missing ? [warn] one warning found org.apache.maven.model.building.ModelBuildingException: 1 problem

Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Soumya Simanta
I'm deploying Spark using the Click to Deploy Hadoop - Install Apache Spark on Google Compute Engine. I can run Spark jobs on the REPL and read data from Google storage. However, I'm not sure how to access the Spark UI in this deployment. Can anyone help? Also, it deploys Spark 1.1. It there an

Re: Sharing sqlContext between Akka router and routee actors ...

2014-12-18 Thread Soumya Simanta
why do you need a router? I mean cannot you do with just one actor which has the SQLContext inside it? On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel manojsamelt...@gmail.com wrote: Hi, Akka router creates a sqlContext and creates a bunch of routees actors with sqlContext as parameter. The

Trying to understand a basic difference between these two configurations

2014-12-05 Thread Soumya Simanta
I'm trying to understand the conceptual difference between these two configurations in term of performance (using Spark standalone cluster) Case 1: 1 Node 60 cores 240G of memory 50G of data on local file system Case 2: 6 Nodes 10 cores per node 40G of memory per node 50G of data on HDFS nodes

Re: Spark or MR, Scala or Java?

2014-11-22 Thread Soumya Simanta
Thanks Sean. adding user@spark.apache.org again. On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote: On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta soumya.sima...@gmail.com wrote: Is the MapReduce API simpler or the implementation? Almost, every Spark presentation has

Re: MongoDB Bulk Inserts

2014-11-21 Thread Soumya Simanta
(inputFile) .map(parser.parse) .mapPartitions(bulkLoad) But the Iterator[T] of mapPartitions is always empty, even though I know map is generating records. On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta soumya.sima...@gmail.com wrote: On Thu, Nov 20, 2014

Re: MongoDB Bulk Inserts

2014-11-20 Thread Soumya Simanta
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com wrote: I'm trying to use MongoDB as a destination for an ETL I'm writing in Spark. It appears I'm gaining a lot of overhead in my system databases (and possibly in the primary documents themselves); I can only assume

Parsing a large XML file using Spark

2014-11-18 Thread Soumya Simanta
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump that all revision information also) that is stored in HDFS, is it possible to parse it in parallel/faster using Spark? Or do we have to use something like a PullParser or Iteratee? My current solution is to read the single

SparkSQL performance

2014-10-31 Thread Soumya Simanta
I was really surprised to see the results here, esp. SparkSQL not completing http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style I was under the impression that SparkSQL performs really well because it can optimize the RDD operations and load only the columns that are required.

Re: SparkSQL performance

2014-10-31 Thread Soumya Simanta
tuning tricks of all products. However, realistically, there is a big gap in terms of documentation. Hope the Spark folks will make a difference. :-) Du From: Soumya Simanta soumya.sima...@gmail.com Date: Friday, October 31, 2014 at 4:04 PM To: user@spark.apache.org user@spark.apache.org

Re: sbt/sbt compile error [FATAL]

2014-10-29 Thread Soumya Simanta
Are you trying to compile the master branch ? Can you try branch-1.1 ? On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS hanspeter.sl...@gmail.com wrote: Hi, I have cloned sparked as: git clone g...@github.com:apache/spark.git cd spark sbt/sbt compile Apparently

Re: install sbt

2014-10-28 Thread Soumya Simanta
sbt is just a jar file. So you really don't need to install anything. Once you run the jar file (sbt-launch.jar) it can download the required dependencies. I use an executable script called sbt that has the following contents. SBT_OPTS=-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled

Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
Try reducing the resources (cores and memory) of each application. On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote: Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks,

Re: run multiple spark applications in parallel

2014-10-28 Thread Soumya Simanta
--driver-memory 1g --executor-memory 1g --executor-cores 1 UBER.JAR ${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1 The box has 24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz 32 GB RAM Thanks, Josh On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com wrote

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-27 Thread Soumya Simanta
You need to change the Scala compiler from IntelliJ to “sbt incremental compiler” (see the screenshot below). You can access this by going to “preferences” ­ “scala”. NOTE: This is supported only for certain version of IntelliJ scala plugin. See this link for details.

Re: Spark as Relational Database

2014-10-26 Thread Soumya Simanta
keep everything in Spark. On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta soumya.sima...@gmail.com wrote: 1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option

Re: Spark as Relational Database

2014-10-25 Thread Soumya Simanta
1. What data store do you want to store your data in ? HDFS, HBase, Cassandra, S3 or something else? 2. Have you looked at SparkSQL (https://spark.apache.org/sql/)? One option is to process the data in Spark and then store it in the relational database of your choice. On Sat, Oct 25, 2014 at

Does start-slave.sh use the values in conf/slaves to launch a worker in Spark standalone cluster mode

2014-10-20 Thread Soumya Simanta
I'm working a cluster where I need to start the workers separately and connect them to a master. I'm following the instructions here and using branch-1.1 http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually and I can start the master using ./sbin/start-master.sh

Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Soumya Simanta
I've a SchemaRDD that I want to convert to a RDD that contains String. How do I convert the Row inside the SchemaRDD to String?

Storing shuffle files on a Tachyon

2014-10-07 Thread Soumya Simanta
Is it possible to store spark shuffle files on Tachyon ?

Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new

Setting serializer to KryoSerializer from command line for spark-shell

2014-09-20 Thread Soumya Simanta
Hi, I want to set the serializer for my spark-shell to Kyro. spark.serializer to org.apache.spark.serializer.KryoSerializer Can I do it without setting a new SparkConf? Thanks -Soumya

Re: Spark as a Library

2014-09-16 Thread Soumya Simanta
It depends on what you want to do with Spark. The following has worked for me. Let the container handle the HTTP request and then talk to Spark using another HTTP/REST interface. You can use the Spark Job Server for this. Embedding Spark inside the container is not a great long term solution IMO

Re: About SpakSQL OR MLlib

2014-09-15 Thread Soumya Simanta
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int, fuel:Int) 1. Create an PairedRDD of (age,Car) tuples (pairedRDD) 2. Create a new function fc //returns the interval lower and upper bound def fc(x:Int, interval:Int) : (Int,Int) = { val floor = x - (x%interval)

Re: Spark and Scala

2014-09-12 Thread Soumya Simanta
An RDD is a fault-tolerant distributed structure. It is the primary abstraction in Spark. I would strongly suggest that you have a look at the following to get a basic idea. http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html

Re: Running Spark shell on YARN

2014-08-16 Thread Soumya Simanta
$Parser.parse(URI.java:3038) at java.net.URI.init(URI.java:753) at org.apache.hadoop.fs.Path.initialize(Path.java:203) ... 62 more Spark context available as sc. On Fri, Aug 15, 2014 at 3:49 PM, Soumya Simanta soumya.sima...@gmail.com wrote: After changing the allocation I'm getting

Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
I've been using the standalone cluster all this time and it worked fine. Recently I'm using another Spark cluster that is based on YARN and I've not experience with YARN. The YARN cluster has 10 nodes and a total memory of 480G. I'm having trouble starting the spark-shell with enough memory. I'm

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread main java.lang.Exception: When running with master 'yarn

Re: Running Spark shell on YARN

2014-08-15 Thread Soumya Simanta
, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote: We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks

Script to deploy spark to Google compute engine

2014-08-13 Thread Soumya Simanta
Before I start doing something on my own I wanted to check if someone has created a script to deploy the latest version of Spark to Google Compute Engine. Thanks -Soumya

Re: Transform RDD[List]

2014-08-11 Thread Soumya Simanta
Try something like this. scala val a = sc.parallelize(List(1,2,3,4,5)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:12 scala val b = sc.parallelize(List(6,7,8,9,10)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
Check your executor logs for the output or if your data is not big collect it in the driver and print it. On Jul 16, 2014, at 9:21 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi All, I'm trying to do a simple record matching between 2 files and wrote following

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Check your executor logs for the output or if your data is not big collect it in the driver and print it. On Jul 16, 2014, at 9:21 AM, Sarath Chandra sarathchandra.jos...@algofusiontech.com wrote: Hi All

Re: Simple record matching using Spark SQL

2014-07-16 Thread Soumya Simanta
. Barring the statements to create the spark context, if I copy paste the lines of my code in spark shell, runs perfectly giving the desired output. ~Sarath On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta soumya.sima...@gmail.com wrote: When you submit your job, it should appear

Re: Client application that calls Spark and receives an MLlib *model* Scala Object, not just result

2014-07-14 Thread Soumya Simanta
Please look at the following. https://github.com/ooyala/spark-jobserver http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language https://github.com/EsotericSoftware/kryo You can train your model convert it to PMML and return that to your client OR You can train your model and write that

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
If you are on 1.0.0 release you can also try converting your RDD to a SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield better results. It's worth a try at least. On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Solution 2 is to map

Re: How to separate a subset of an RDD by day?

2014-07-11 Thread Soumya Simanta
I think my best option is to partition my data in directories by day before running my Spark application, and then direct my Spark application to load RDD's from each directory when I want to load a date range. How does this sound? If your upstream system can write data by day then it makes

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Do you have a proxy server ? If yes you need to set the proxy for twitter4j On Jul 11, 2014, at 7:06 PM, SK skrishna...@gmail.com wrote: I dont get any exceptions or error messages. I tried it both with and without VPN and had the same outcome. But I can try again without VPN later

Re: Streaming training@ Spark Summit 2014

2014-07-11 Thread Soumya Simanta
Try running a simple standalone program if you are using Scala and see if you are getting any data. I use this to debug any connection/twitter4j issues. import twitter4j._ //put your keys and creds here object Util { val config = new twitter4j.conf.ConfigurationBuilder()

Re: Comparative study

2014-07-07 Thread Soumya Simanta
Daniel, Do you mind sharing the size of your cluster and the production data volumes ? Thanks Soumya On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained;

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
Are these sessions recorded ? On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote: *General Session / Keynotes : http://www.ustream.tv/channel/spark-summit-2014 http://www.ustream.tv/channel/spark-summit-2014 Track A : http://www.ustream.tv/channel/track-a1

Re: Spark Summit 2014 Day 2 Video Streams?

2014-07-01 Thread Soumya Simanta
, 2014, at 7:47 PM, Marco Shaw marco.s...@gmail.com wrote: They are recorded... For example, 2013: http://spark-summit.org/2013 I'm assuming the 2014 videos will be up in 1-2 weeks. Marco On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com wrote

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi to all, in my use case I'd like to receive events and call an external

Re: Spark streaming and rate limit

2014-06-18 Thread Soumya Simanta
, 2014 at 12:24 AM, Soumya Simanta soumya.sima...@gmail.com wrote: You can add a back pressured enabled component in front that feeds data into Spark. This component can control in input rate to spark. On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote: Hi

Re: Unable to run a Standalone job

2014-05-22 Thread Soumya Simanta
Try cleaning your maven (.m2) and ivy cache. On May 23, 2014, at 12:03 AM, Shrikar archak shrika...@gmail.com wrote: Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1. Thanks, Shrikar On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.com wrote:

Re: Run Apache Spark on Mini Cluster

2014-05-21 Thread Soumya Simanta
Suggestion - try to get an idea of your hardware requirements by running a sample on Amazon's EC2 or Google compute engine. It's relatively easy (and cheap) to get started on the cloud before you invest in your own hardware IMO. On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar

Re: Historical Data as Stream

2014-05-17 Thread Soumya Simanta
@Laeeq - please see this example. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49 On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: @Soumya Simanta Right now its just a prove

Re: Proper way to create standalone app with custom Spark version

2014-05-16 Thread Soumya Simanta
Install your custom spark jar to your local maven or ivy repo. Use this custom jar in your pom/sbt file. On May 15, 2014, at 3:28 AM, Andrei faithlessfri...@gmail.com wrote: (Sorry if you have already seen this message - it seems like there were some issues delivering messages to the

Re: Historical Data as Stream

2014-05-16 Thread Soumya Simanta
File is just a steam with a fixed length. Usually streams don't end but in this case it would. On the other hand if you real your file as a steam may not be able to use the entire data in the file for your analysis. Spark (give enough memory) can process large amounts of data quickly. On

Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS

Re: Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-11 Thread Soumya Simanta
into multiple blocks to be read and for subsequent processing. On 11 May 2014 09:01, Soumya Simanta soumya.sima...@gmail.com wrote: I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process

Re: How to use spark-submit

2014-05-11 Thread Soumya Simanta
? Also, are you on the latest commit of branch-1.0 ? TD On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

2014-05-05 Thread Soumya Simanta
I just upgraded my Spark version to 1.0.0_SNAPSHOT. commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3 Author: witgo wi...@qq.com Date: Fri May 2 12:40:27 2014 -0700 I'm running a standalone cluster with 3 workers. - *Workers:* 3 - *Cores:* 48 Total, 0 Used - *Memory:* 469.8 GB

Problem with sharing class across worker nodes using spark-shell on Spark 1.0.0

2014-05-05 Thread Soumya Simanta
Hi, I'm trying to run a simple Spark job that uses a 3rd party class (in this case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT I'm starting my bin/spark-shell with the following command. ./spark-shell

Re: How to use spark-submit

2014-05-05 Thread Soumya Simanta
Yes, I'm struggling with a similar problem where my class are not found on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone can provide some documentation on the usage of spark-submit. Thanks On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com

Re: Announcing Spark SQL

2014-03-26 Thread Soumya Simanta
Very nice. Any plans to make the SQL typesafe using something like Slick ( http://slick.typesafe.com/) Thanks ! On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote: Hey Everyone, This already went out to the dev list, but I wanted to put a pointer here as well to

Help with building and running examples with GraphX from the REPL

2014-02-25 Thread Soumya Simanta
I'm not able to run the GraphX examples from the Scala REPL. Can anyone point to the correct documentation that talks about the configuration and/or how to build GraphX for the REPL ? Thanks