I'm trying to understand what the following configurations mean and their
implication on reading data from a MySQL table. I'm looking for options
that will impact my read throughput when reading data from a large table.
Thanks.
partitionColumn, lowerBound, upperBound, numPartitions These
Can you give some examples of what variables you are trying to set ?
On Thu, Feb 18, 2016 at 1:01 AM, Lin Zhao wrote:
> I've been trying to set some environment variables for the spark executors
> but haven't had much like. I tried editting conf/spark-env.sh but it
> doesn't
, Jan 29, 2015 at 11:35 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
Does the error change on build with and without the built options?
What do you mean by build options? I'm just doing ./sbt/sbt
On Thu, Jan 29, 2015 at 11:05 AM, Arush Kharbanda
ar...@sigmoidanalytics.com wrote:
Does the error change on build with and without the built options?
What do you mean by build options? I'm just doing ./sbt/sbt assembly from
$SPARK_HOME
Did you try using maven? and doing the proxy settings
I'm trying to build Spark (v1.1.1 and v1.2.0) behind a proxy using
./sbt/sbt assembly and I get the following error. I've set the http and
https proxy as well as the JAVA_OPTS. Any idea what am I missing ?
[warn] one warning found
org.apache.maven.model.building.ModelBuildingException: 1 problem
I'm deploying Spark using the Click to Deploy Hadoop - Install Apache
Spark on Google Compute Engine.
I can run Spark jobs on the REPL and read data from Google storage.
However, I'm not sure how to access the Spark UI in this deployment. Can
anyone help?
Also, it deploys Spark 1.1. It there an
why do you need a router? I mean cannot you do with just one actor which
has the SQLContext inside it?
On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi,
Akka router creates a sqlContext and creates a bunch of routees actors
with sqlContext as parameter. The
I'm trying to understand the conceptual difference between these two
configurations in term of performance (using Spark standalone cluster)
Case 1:
1 Node
60 cores
240G of memory
50G of data on local file system
Case 2:
6 Nodes
10 cores per node
40G of memory per node
50G of data on HDFS
nodes
Thanks Sean.
adding user@spark.apache.org again.
On Sat, Nov 22, 2014 at 9:35 PM, Sean Owen so...@cloudera.com wrote:
On Sun, Nov 23, 2014 at 2:20 AM, Soumya Simanta
soumya.sima...@gmail.com wrote:
Is the MapReduce API simpler or the implementation? Almost, every Spark
presentation has
(inputFile)
.map(parser.parse)
.mapPartitions(bulkLoad)
But the Iterator[T] of mapPartitions is always empty, even though I know
map is generating records.
On Thu Nov 20 2014 at 9:25:54 PM Soumya Simanta soumya.sima...@gmail.com
wrote:
On Thu, Nov 20, 2014
On Thu, Nov 20, 2014 at 10:18 PM, Benny Thompson ben.d.tho...@gmail.com
wrote:
I'm trying to use MongoDB as a destination for an ETL I'm writing in
Spark. It appears I'm gaining a lot of overhead in my system databases
(and possibly in the primary documents themselves); I can only assume
If there a one big XML file (e.g., Wikipedia dump 44GB or the larger dump
that all revision information also) that is stored in HDFS, is it possible
to parse it in parallel/faster using Spark? Or do we have to use something
like a PullParser or Iteratee?
My current solution is to read the single
I was really surprised to see the results here, esp. SparkSQL not
completing
http://www.citusdata.com/blog/86-making-postgresql-scale-hadoop-style
I was under the impression that SparkSQL performs really well because it
can optimize the RDD operations and load only the columns that are
required.
tuning tricks of all products. However, realistically, there is a
big gap in terms of documentation. Hope the Spark folks will make a
difference. :-)
Du
From: Soumya Simanta soumya.sima...@gmail.com
Date: Friday, October 31, 2014 at 4:04 PM
To: user@spark.apache.org user@spark.apache.org
Are you trying to compile the master branch ? Can you try branch-1.1 ?
On Wed, Oct 29, 2014 at 6:55 AM, HansPeterS hanspeter.sl...@gmail.com
wrote:
Hi,
I have cloned sparked as:
git clone g...@github.com:apache/spark.git
cd spark
sbt/sbt compile
Apparently
sbt is just a jar file. So you really don't need to install anything. Once
you run the jar file (sbt-launch.jar) it can download the required
dependencies.
I use an executable script called sbt that has the following contents.
SBT_OPTS=-Xms1024M -Xmx2048M -Xss1M -XX:+CMSClassUnloadingEnabled
Try reducing the resources (cores and memory) of each application.
On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote:
Hi,
How do I run multiple spark applications in parallel? I tried to run on yarn
cluster, though the second application submitted does not run.
Thanks,
--driver-memory
1g --executor-memory 1g --executor-cores 1 UBER.JAR
${ZK_PORT_2181_TCP_ADDR} my-consumer-group1 1
The box has
24 CPUs, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
32 GB RAM
Thanks,
Josh
On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com
wrote
You need to change the Scala compiler from IntelliJ to “sbt incremental
compiler” (see the
screenshot below).
You can access this by going to “preferences” “scala”.
NOTE: This is supported only for certain version of IntelliJ scala plugin.
See this link for details.
keep
everything in Spark.
On Sat, Oct 25, 2014 at 11:34 PM, Soumya Simanta
soumya.sima...@gmail.com wrote:
1. What data store do you want to store your data in ? HDFS, HBase,
Cassandra, S3 or something else?
2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
One option
1. What data store do you want to store your data in ? HDFS, HBase,
Cassandra, S3 or something else?
2. Have you looked at SparkSQL (https://spark.apache.org/sql/)?
One option is to process the data in Spark and then store it in the
relational database of your choice.
On Sat, Oct 25, 2014 at
I'm working a cluster where I need to start the workers separately and
connect them to a master.
I'm following the instructions here and using branch-1.1
http://spark.apache.org/docs/latest/spark-standalone.html#starting-a-cluster-manually
and I can start the master using
./sbin/start-master.sh
I've a SchemaRDD that I want to convert to a RDD that contains String. How
do I convert the Row inside the SchemaRDD to String?
Is it possible to store spark shuffle files on Tachyon ?
I'm trying to understand the intuition behind the features method that
Aaron used in one of his demos. I believe this feature will just work for
detecting the character set (i.e., language used).
Can someone help ?
def featurize(s: String): Vector = {
val n = 1000
val result = new
Hi,
I want to set the serializer for my spark-shell to Kyro.
spark.serializer to org.apache.spark.serializer.KryoSerializer
Can I do it without setting a new SparkConf?
Thanks
-Soumya
It depends on what you want to do with Spark. The following has worked for
me.
Let the container handle the HTTP request and then talk to Spark using
another HTTP/REST interface. You can use the Spark Job Server for this.
Embedding Spark inside the container is not a great long term solution IMO
case class Car(id:String,age:Int,tkm:Int,emissions:Int,date:Date, km:Int,
fuel:Int)
1. Create an PairedRDD of (age,Car) tuples (pairedRDD)
2. Create a new function fc
//returns the interval lower and upper bound
def fc(x:Int, interval:Int) : (Int,Int) = {
val floor = x - (x%interval)
An RDD is a fault-tolerant distributed structure. It is the primary
abstraction in Spark.
I would strongly suggest that you have a look at the following to get a
basic idea.
http://www.cs.berkeley.edu/~pwendell/strataconf/api/core/spark/RDD.html
$Parser.parse(URI.java:3038)
at java.net.URI.init(URI.java:753)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
... 62 more
Spark context available as sc.
On Fri, Aug 15, 2014 at 3:49 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
After changing the allocation I'm getting
I've been using the standalone cluster all this time and it worked fine.
Recently I'm using another Spark cluster that is based on YARN and I've not
experience with YARN.
The YARN cluster has 10 nodes and a total memory of 480G.
I'm having trouble starting the spark-shell with enough memory.
I'm
, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Andrew,
Thanks for your response.
When I try to do the following.
./spark-shell --executor-memory 46g --master yarn
I get the following error.
Exception in thread main java.lang.Exception: When running with master
'yarn
, 2014 at 2:47 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
We generally recommend setting yarn.scheduler.maximum-allocation-mbto the
maximum node capacity.
-Sandy
On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
I just checked the YARN config and looks
Before I start doing something on my own I wanted to check if someone has
created a script to deploy the latest version of Spark to Google Compute
Engine.
Thanks
-Soumya
Try something like this.
scala val a = sc.parallelize(List(1,2,3,4,5))
a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize
at console:12
scala val b = sc.parallelize(List(6,7,8,9,10))
b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize
at
Check your executor logs for the output or if your data is not big collect it
in the driver and print it.
On Jul 16, 2014, at 9:21 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Hi All,
I'm trying to do a simple record matching between 2 files and wrote following
On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Check your executor logs for the output or if your data is not big
collect it in the driver and print it.
On Jul 16, 2014, at 9:21 AM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Hi All
.
Barring the statements to create the spark context, if I copy paste the lines
of my code in spark shell, runs perfectly giving the desired output.
~Sarath
On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
When you submit your job, it should appear
Please look at the following.
https://github.com/ooyala/spark-jobserver
http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
https://github.com/EsotericSoftware/kryo
You can train your model convert it to PMML and return that to your client
OR
You can train your model and write that
If you are on 1.0.0 release you can also try converting your RDD to a
SchemaRDD and run a groupBy there. The SparkSQL optimizer may yield
better results. It's worth a try at least.
On Fri, Jul 11, 2014 at 5:24 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Solution 2 is to map
I think my best option is to partition my data in directories by day
before running my Spark application, and then direct
my Spark application to load RDD's from each directory when
I want to load a date range. How does this sound?
If your upstream system can write data by day then it makes
Do you have a proxy server ?
If yes you need to set the proxy for twitter4j
On Jul 11, 2014, at 7:06 PM, SK skrishna...@gmail.com wrote:
I dont get any exceptions or error messages.
I tried it both with and without VPN and had the same outcome. But I can
try again without VPN later
Try running a simple standalone program if you are using Scala and see if
you are getting any data. I use this to debug any connection/twitter4j
issues.
import twitter4j._
//put your keys and creds here
object Util {
val config = new twitter4j.conf.ConfigurationBuilder()
Daniel,
Do you mind sharing the size of your cluster and the production data volumes ?
Thanks
Soumya
On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained;
Are these sessions recorded ?
On Tue, Jul 1, 2014 at 9:47 AM, Alexis Roos alexis.r...@gmail.com wrote:
*General Session / Keynotes :
http://www.ustream.tv/channel/spark-summit-2014
http://www.ustream.tv/channel/spark-summit-2014 Track A
: http://www.ustream.tv/channel/track-a1
, 2014, at 7:47 PM, Marco Shaw marco.s...@gmail.com wrote:
They are recorded... For example, 2013: http://spark-summit.org/2013
I'm assuming the 2014 videos will be up in 1-2 weeks.
Marco
On Tue, Jul 1, 2014 at 3:18 PM, Soumya Simanta soumya.sima...@gmail.com
wrote
You can add a back pressured enabled component in front that feeds data into
Spark. This component can control in input rate to spark.
On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it wrote:
Hi to all,
in my use case I'd like to receive events and call an external
, 2014 at 12:24 AM, Soumya Simanta soumya.sima...@gmail.com
wrote:
You can add a back pressured enabled component in front that feeds data
into Spark. This component can control in input rate to spark.
On Jun 18, 2014, at 6:13 PM, Flavio Pompermaier pomperma...@okkam.it
wrote:
Hi
Try cleaning your maven (.m2) and ivy cache.
On May 23, 2014, at 12:03 AM, Shrikar archak shrika...@gmail.com wrote:
Yes I did a sbt publish-local. Ok I will try with Spark 0.9.1.
Thanks,
Shrikar
On Thu, May 22, 2014 at 8:53 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Suggestion - try to get an idea of your hardware requirements by running a
sample on Amazon's EC2 or Google compute engine. It's relatively easy (and
cheap) to get started on the cloud before you invest in your own hardware
IMO.
On Wed, May 21, 2014 at 8:14 PM, Upender Nimbekar
@Laeeq - please see this example.
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala#L47-L49
On Sat, May 17, 2014 at 2:06 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
@Soumya Simanta
Right now its just a prove
Install your custom spark jar to your local maven or ivy repo. Use this custom
jar in your pom/sbt file.
On May 15, 2014, at 3:28 AM, Andrei faithlessfri...@gmail.com wrote:
(Sorry if you have already seen this message - it seems like there were some
issues delivering messages to the
File is just a steam with a fixed length. Usually streams don't end but in this
case it would.
On the other hand if you real your file as a steam may not be able to use the
entire data in the file for your analysis. Spark (give enough memory) can
process large amounts of data quickly.
On
I've a Spark cluster with 3 worker nodes.
- *Workers:* 3
- *Cores:* 48 Total, 48 Used
- *Memory:* 469.8 GB Total, 72.0 GB Used
I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB
compressed and 11GB uncompressed.
When I try to read the compressed file from HDFS
into multiple blocks to
be read and for subsequent processing.
On 11 May 2014 09:01, Soumya Simanta soumya.sima...@gmail.com wrote:
I've a Spark cluster with 3 worker nodes.
- *Workers:* 3
- *Cores:* 48 Total, 48 Used
- *Memory:* 469.8 GB Total, 72.0 GB Used
I want a process
? Also, are you on the latest
commit of branch-1.0 ?
TD
On Mon, May 5, 2014 at 7:51 PM, Soumya Simanta soumya.sima...@gmail.com
wrote:
Yes, I'm struggling with a similar problem where my class are not found
on the worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate
I just upgraded my Spark version to 1.0.0_SNAPSHOT.
commit f25ebed9f4552bc2c88a96aef06729d9fc2ee5b3
Author: witgo wi...@qq.com
Date: Fri May 2 12:40:27 2014 -0700
I'm running a standalone cluster with 3 workers.
- *Workers:* 3
- *Cores:* 48 Total, 0 Used
- *Memory:* 469.8 GB
Hi,
I'm trying to run a simple Spark job that uses a 3rd party class (in this
case twitter4j.Status) in the spark-shell using spark-1.0.0_SNAPSHOT
I'm starting my bin/spark-shell with the following command.
./spark-shell
Yes, I'm struggling with a similar problem where my class are not found on the
worker nodes. I'm using 1.0.0_SNAPSHOT. I would really appreciate if someone
can provide some documentation on the usage of spark-submit.
Thanks
On May 5, 2014, at 10:24 PM, Stephen Boesch java...@gmail.com
Very nice.
Any plans to make the SQL typesafe using something like Slick (
http://slick.typesafe.com/)
Thanks !
On Wed, Mar 26, 2014 at 5:58 PM, Michael Armbrust mich...@databricks.comwrote:
Hey Everyone,
This already went out to the dev list, but I wanted to put a pointer here
as well to
I'm not able to run the GraphX examples from the Scala REPL. Can anyone
point to the correct documentation that talks about the configuration
and/or how to build GraphX for the REPL ?
Thanks
61 matches
Mail list logo