Run Spark Cluster managed my Apache Mesos. Mesos can run in high-availability
mode, in which multiple Mesos masters run simultaneously.
-
Software Developer
SigmoidAnalytics, Bangalore
--
View this message in context:
Try Following any one :
*1. Set the access key and secret key in the sparkContext:*
sparkContext.set(AWS_ACCESS_KEY_ID,yourAccessKey)
sparkContext.set(AWS_SECRET_ACCESS_KEY,yourSecretKey)
*2. Set the access key and secret key in the environment before starting
your application:*
export
Its a jar conflict (http-client
http://mvnrepository.com/artifact/org.apache.httpcomponents/httpclient/4.0-alpha4
jar), You could download the appropriate version of that jar and put that
in the classpath before your assembly jar and hopefully it will avoid the
conflict.
Thanks
Best Regards
On
The scenario is using HTable instance to scan multiple rowkey range in Spark
tasks look likes below:
Option 1:
val users = input
.map { case (deviceId, uid) =
uid}.distinct().sortBy(x=x).mapPartitions(iterator={
val conf = HBaseConfiguration.create()
val table = new HTable(conf,
Can you paste the complete code? it looks like at some point you are
passing a hadoop's configuration which is not Serializable. You can look at
this thread for similar discussion
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-into-HBase-td13378.html
Thanks
Best Regards
On
Hi, Thanks for your reply.
I tried with the jar you pointed but It complains about missing
HttpPatch that appears on httpclient 4.2
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/http/client/methods/HttpPatch at
RDD is immutable so you can not modify it.
If you want to modify some value or schema in RDD, using map to generate a
new RDD.
The following code for your reference:
def add(a:Int,b:Int):Int = {
a + b
}
val d1 = sc.parallelize(1 to 10).map { i = (i, i+1, i+2) }
val d2 = d1.map { i = (i._1,
Hi,
I am trying to leftJoin an other vertice RDD(e.g vB) with this one(vA).
vA.leftJoin(vB)(f)
- vA is the vertices RDD in graph G, and G is edge-partitioned using
EdgePartition2D.
- vB is created using default partitioner(actually I am not sure...)
So, I am wondering, that if vB has same
Hi,
I'm new to this list, so please excuse if I'm asking simple questions.
We are experimenting spark deployment on existing CDH clusters.
However the spark package come with CDH are very out of date (v1.0.0).
Has anyone had experience with custom Spark upgrade for CDH5? Any
No, CDH 5.2 includes Spark 1.1 actually, which is the latest released
minor version; 5.3 will include 1.2, which not released yet.
You can make a build of just about any version of Spark for CDH5, and
manually install it yourself, sure, but easier would be to just update
CDH. The instructions are
Hi,
I'm a newbie with Spark,, I'm just trying to use SparkStreaming and
filter some data sent with a Java Socket but it's not working... it
works when I use ncat
Why is it not working??
My sparkcode is just this:
val sparkConf = new SparkConf().setMaster(local[2]).setAppName(Test)
val
When I try to load a text file from a HDFS path using
sc.wholeTextFiles(hdfs://localhost:54310/graphx/anywebsite.com/anywebsite.com/)
I'm get the following error:
java.io.FileNotFoundException: Path is not a file:
/graphx/anywebsite.com/anywebsite.com/css
(full stack trace at bottom of
Hi,
I’ve started my first experiments with Spark Streaming and started with setting
up an environment using ScalaTest to do unit testing. Poked around on this
mailing list and googled the topic.
One of the things I wanted to be able to do is to use Scala Sequences as data
source in the tests
I have created a Serversocket program which you can find over here
https://gist.github.com/akhld/4286df9ab0677a555087 It simply listens to the
given port and when the client connects, it will send the contents of the
given file. I'm attaching the executable jar also, you can run the jar as:
java
I'm not quiet sure whether spark will go inside subdirectories and pick up
files from it. You could do something like following to bring all files to
one directory.
find . -iname '*' -exec mv '{}' . \;
Thanks
Best Regards
On Fri, Dec 12, 2014 at 6:34 PM, Karen Murphy k.l.mur...@qub.ac.uk
I dont' understand what spark streaming socketTextStream is waiting...
is it like a server so you just have to send data from a client?? or
what's it excepting?
2014-12-12 14:19 GMT+01:00 Akhil Das ak...@sigmoidanalytics.com:
I have created a Serversocket program which you can find over here
socketTextStream is Socket client which will read from a TCP ServerSocket.
Thanks
Best Regards
On Fri, Dec 12, 2014 at 7:21 PM, Guillermo Ortiz konstt2...@gmail.com
wrote:
I dont' understand what spark streaming socketTextStream is waiting...
is it like a server so you just have to send data
Try this
https://github.com/RetailRocket/SparkMultiTool
https://github.com/RetailRocket/SparkMultiTool
This loader solved slow reading of a big data set of small files in hdfs.
--
View this message in context:
I had this problem also with spark 1.1.1. At the time I was using hadoop
0.20.
To get around it I installed hadoop 2.5.2, and set the protobuf.version to
2.5.0 in the build command like so:
mvn -Phadoop-2.5 -Dhadoop.version=2.5.2 -Dprotobuf.version=2.5.0
-DskipTests clean package
So I
There is no hadoop-2.5 profile. You can use hadoop-2.4 for 2.4+. This
profile already sets protobuf.version to 2.5.0 for this reason. It is
already something you can set on the command line as it is read as a
Maven build property. It does not pick up an older version because
it's somewhere on your
On Fri, Dec 12, 2014 at 2:17 PM, Eric Loots eric.lo...@gmail.com wrote:
How can the log level in test mode be reduced (or extended when needed) ?
Hello Eric,
The following might be helpful for reducing the log messages during unit
testing:
http://stackoverflow.com/a/2736/236007
--
Emre
Hi,
I had time to try it again. I submited my app by the same command with
these additional options:
--jars lib/datanucleus-api-jdo-3.2.6.jar,lib/datanucleus-core-3.2.10.jar,
lib/datanucleus-rdbms-3.2.9.jar
Now an app successfully creates hive context. So my question remains: Is
Hi Sowen,
Thanks for the tip. When will CDH 5.3 be released? Some sort of
timeline would be helpful.
Thanks,
Jing
On 12 Dec 2014, at 12:34, Sean Owen so...@cloudera.com wrote:
No, CDH 5.2 includes Spark 1.1 actually, which is the latest released
minor version; 5.3 will include 1.2, which
Hi,
I asked on SO and got an answer about this
http://stackoverflow.com/questions/27444512/missing-classes-from-the-assembly-file-created-by-sbt-assembly
. Adding fullClasspath in assembly := (fullClasspath in Compile).value
at the end of my builld.sbt solved the problem, apparently.
Best,
Hi everyone,
I am trying to save the decision tree model in python and I use
pickle.dump() to do this. However, it returns the following error
information:
/cPickle.UnpickleableError: Cannot pickle type 'thread.lock' objects/
I did some tests on the other model. It seems that decision tree
Looks like the way to go.
Quick question regarding the connection pool approach - if I have a connection
that gets lazily instantiated, will it automatically die if I kill the driver
application? In my scenario, I can keep a connection open for the duration of
the app, and aren't that
Thanks for the reply. I might be misunderstanding something basic.As far
as I can tell, the cluster manager (e.g. Mesos) manages the master and
worker nodes but not the drivers or receivers, those are external to the
spark cluster:
http://spark.apache.org/docs/latest/cluster-overview.html
IIUC, Receivers run on workers, colocated with other tasks.
The Driver, on the other hand, can either run on the querying machine (local
mode) or as a worker (cluster mode).
—
FG
On Fri, Dec 12, 2014 at 4:49 PM, twizansk twiza...@gmail.com wrote:
Thanks for the reply. I might be
(1) I understand about immutability, that's why I said I wanted a new
SchemaRDD.
(2) I specfically asked for a non-SQL solution that takes a SchemaRDD, and
results in a new SchemaRDD with one new function.
(3) The DSL stuff is a big clue, but I can't find adequate documentation
for it
What I'm
https://github.com/jayunit100/SparkStreamingCassandraDemo
On this note, I've built a framework which is mostly pure so that functional
unit tests can be run composing mock data for Twitter statuses, with just
regular junit... That might be relevant also.
I think at some point we should come
Hi,
Trying to use LBFGS as the optimizer, do I need to implement feature scaling
via StandardScaler or does LBFGS do it by default?
Following code generated error Failure again! Giving up and returning,
Maybe the objective is just poorly behaved ?.
val data =
I'm still wrapping my head around that fact that the data backing an RDD is
immutable since an RDD may need to be reconstructed from its lineage at any
point. In the context of clustering there are many iterations where an RDD may
need to change (for instance cluster assignments, etc) based on
Hi to all,
Our problem was passing configuration from Spark Driver to the Slaves.
After a lot of time spent figuring out how things work, this is the
solution I came up with.
Hope this will be helpful for others as well.
You can read about it in my Blog Post
Thanks Marcelo.
Spark Gurus/Databricks team - do you have something in roadmap for such a
spark server ?
Thanks,
On Thu, Dec 11, 2014 at 5:43 PM, Marcelo Vanzin van...@cloudera.com wrote:
Oops, sorry, fat fingers.
We've been playing with something like that inside Hive:
Hi Guys:
I want to kill an application but I could not find the driver id of
the application from web ui. Is there any way to get it from command line?
Thanks
--
Sincerely Yours
Xingwei Yang
https://sites.google.com/site/xingweiyang1223/
You need to do the StandardScaler to help the convergency yourself.
LBFGS just takes whatever objective function you provide without doing
any scaling. I will like to provide LinearRegressionWithLBFGS which
does the scaling internally in the nearly feature.
Sincerely,
DB Tsai
Thanks for the confirmation.
Fyi..The code below works for similar dataset, but with the feature magnitude
changed, LBFGS converged to the right weights.
Example, time sequential Feature value 1, 2, 3, 4, 5, would generate the error
while sequential feature 14111, 14112, 14113,14115 would
Hi All,
I have spark on yarn and there are multiple spark jobs on the cluster.
Sometimes some jobs are not getting enough resources even when there are
enough free resources available on cluster, even when I use below settings
--num-workers 75 \
--worker-cores 16
Jobs stick with the resources
Hi!
tldr; We're looking at potentially using Spark+GraphX to compute PageRank
over a 4 billion node + 128 billion edge graph on a regular (monthly)
basis, possibly growing larger in size over time. If anyone has hints /
tips / upcoming optimizations I should test out (or wants to contribute --
I have an RDD which is potentially too large to store in memory with
collect. I want a single task to write the contents as a file to hdfs. Time
is not a large issue but memory is.
I say the following converting my RDD (scans) to a local Iterator. This
works but hasNext shows up as a separate task
It seems that your response is not scaled which will cause issue in
LBFGS. Typically, people train Linear Regression with
zero-mean/unit-variable feature and response without training the
intercept. Since the response is zero-mean, the intercept will be
always zero. When you convert the
Hi Experts,
I have recently installed HDP2.2(Depends on hadoop 2.6).
My spark 1.2 is built with hadoop 2.4 profile.
My program has following dependencies
val avro= org.apache.avro % avro-mapred %1.7.7
val spark = org.apache.spark % spark-core_2.10 % 1.2.0 %
provided
My
Hi Experts,
I have recently installed HDP2.2(Depends on hadoop 2.6).
My spark 1.2 is built with hadoop 2.3 profile.
/( mvn -Pyarn -Dhadoop.version=2.6.0 -Dyarn.version=2.6.0 -Phadoop-2.3
-Phive -DskipTests clean package)/
My program has following dependencies
/val avro=
Instead of doing this on the compute side, I would just write out the file
with different blocks initially into HDFS and then use hadoop fs
-getmerge or HDFSConcat to get one final output file.
- SF
On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis lordjoe2...@gmail.com wrote:
I have an RDD
Thanks for the info.
How do I use StandardScaler() to scale example data (10246.0,[14111.0,1.0]) ?
Thx
tri
-Original Message-
From: dbt...@dbtsai.com [mailto:dbt...@dbtsai.com]
Sent: Friday, December 12, 2014 1:26 PM
To: Bui, Tri
Cc: user@spark.apache.org
Subject: Re: Do I need to
Hi,
FYI - There are no Worker JVMs used when Spark is launched under YARN.
Instead the NodeManager in YARN does what the Worker JVM does in Spark
Standalone mode.
For YARN you'll want to look into the following settings:
--num-executors: controls how many executors will be allocated
You can do something like the following.
val rddVector = input.map({
case (response, vec) = {
val newVec = MLUtils.appendBias(vec)
newVec.toBreeze(newVec.size - 1) = response
newVec
}
}
val scalerWithResponse = new StandardScaler(true, true).fit(rddVector)
val trainingData =
Hey Manoj,
One proposal potentially of interest is the Spark Kernel project from
IBM - you should look for their. The interface in that project is more
of a remote REPL interface, i.e. you submit commands (as strings)
and get back results (as strings), but you don't have direct
programmatic
but on spark 0.9 we don't have these options
--num-executors: controls how many executors will be allocated
--executor-memory: RAM for each executor
--executor-cores: CPU cores for each executor
On Fri, Dec 12, 2014 at 12:27 PM, Sameer Farooqui same...@databricks.com
wrote:
Hi,
FYI - There
We are happy to announce a developer preview of the Spark Kernel which
enables remote applications to dynamically interact with Spark. You can
think of the Spark Kernel as a remote Spark Shell that uses the IPython
notebook interface to provide a common entrypoint for any application. The
Spark
Haoming
If the Spark UI states that one of the jobs is in the Waiting state, this
is a resources issue. You will need to set properties such as:
spark.executor.memory
spark.cores.max
Set these so that each instance only takes a portion of the available worker
memory and cores.
Regards,
Mike
Hi!
tldr; We're looking at potentially using Spark+GraphX to compute PageRank
over a 4 billion node + 128 billion edge graph on a regular (monthly) basis,
possibly growing larger in size over time. If anyone has hints / tips /
upcoming optimizations I should test use (or wants to contribute --
The objective is to let the Spark application generate a file in a format
which can be consumed by other programs - as I said I am willing to give up
parallelism at this stage (all the expensive steps were earlier but do want
an efficient way to pass once through an RDD without the requirement to
what would good spill settings be?
On Fri, Dec 12, 2014 at 2:45 PM, Sameer Farooqui same...@databricks.com
wrote:
You could try re-partitioning or coalescing the RDD to partition and then
write it to disk. Make sure you have good spill settings enabled so that
the RDD can spill to the local
Hello,
Few questions on Spark SQL:
1) Does Spark SQL support equivalent SQL Query for Scala command:
IsCached(table name) ?
2) Is there a documentation spec I can reference for question like this?
Closest doc I can find is this one:
http://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
On Fri, Dec 12, 2014 at 3:14 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Hello,
Few questions on Spark SQL:
1) Does Spark SQL support equivalent SQL Query for Scala command:
Hi Sam,
We developed the Spark Kernel with a focus on the newest version of the
IPython message protocol (5.0) for the upcoming IPython 3.0 release.
We are building around Apache Spark's REPL, which is used in the current
Spark Shell implementation.
The Spark Kernel was designed to be
I'm looking at the source code of SVM.scala and trying to find the location
of the source code of the following function:
def train(...): SVMModel = { new SVMWithSGD( ... ).run(input,
initialWeights) }
I'm wondering where I can find the code for SVMWithSGD().run()?
I'd like to see the
class SVMWithSGD is defined in the same file you're already looking
at. It inherits the run() method from its superclass,
GeneralizedLinearAlgorithm. An IDE would help you trace this right
away.
On Sat, Dec 13, 2014 at 12:52 AM, Caron caron.big...@gmail.com wrote:
I'm looking at the source code
What is the proper way to build with hive from sbt? The SPARK_HIVE is
deprecated. However after running the following:
sbt -Pyarn -Phadoop-2.3 -Phive assembly/assembly
And then
bin/pyspark
hivectx = HiveContext(sc)
hivectx.hiveql(select * from my_table)
Exception: (You must build
I am using *updateStateByKey *to maintain state in my streaming
application, the state gets accumulated over time.
Is there a way i can delete the old state data or put a limit on the amount
of state the State Dstream can keep in the system.
Thanks,
Sunil.
I am getting the same message when trying to get HIveContext in CDH 5.1
after enabling Spark. I am thinking Spark should come with Hive enabled
(default option) as Hive metastore is a common way to share data, due to
popularity of Hive and other SQL-Over-Hadoop technologies like Impala.
Thanks,
If you no longer need to maintain state for a key, just return None for that
value and it gets removed.
From: Sunil Yarram yvsu...@gmail.commailto:yvsu...@gmail.com
Date: Friday, December 12, 2014 at 9:44 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Hi,
In addition to the options Sameer Mentioned, we need to enable
external shuffle manager, right?
Thanks,
- Tsuyoshi
On Sat, Dec 13, 2014 at 5:27 AM, Sameer Farooqui same...@databricks.com wrote:
Hi,
FYI - There are no Worker JVMs used when Spark is launched under YARN.
Instead the
Yes, socketTextStream starts a TCP client that tries to connect to a
TCP server (localhost: in your case). If there is a server running
on that port that can send data to connected TCP connections, then you
will receive data in the stream.
Did you check out the quick example in the streaming
There isn’t a SQL statement that directly maps |SQLContext.isCached|,
but you can use |EXPLAIN EXTENDED| to check whether the underlying
physical plan is a |InMemoryColumnarTableScan|.
On 12/13/14 7:14 AM, Judy Nash wrote:
Hello,
Few questions on Spark SQL:
1)Does Spark SQL support
66 matches
Mail list logo