Hi All,
I am working with Spark to add new slaves automatically when there is more data
to be processed by the cluster. During this process there is question arisen,
after adding/removing new slave node to/from the spark cluster do we need to
restart master and other existing slaves in the
Hi,
I am working on auto scaling spark cluster. I would like to know how master
distributes the data to the slaves for processing in detail.
Any information on this would be helpful.
Thanks,
Srujana
--
View this message in context:
Hi,
I am just curious to know what are the difference between the prebuilt
packages for Hadoop1, 2, CDH etc.
I am using spark standalone cluster and we dont use hadoop at all.
Can we use any one of the pre-buil;t packages OR we have to run
make-distribution.sh script from the code?
Thanks,
--
Hi all,
So far, I run my spark jobs with spark-shell or spark-submit command. I'd
like to go further and I wonder how to use spark as a backend of a web
application. Specificaly, I want a frontend application ( build with nodejs
) to communicate with spark on the backend, so that every query
Hi,
I am trying to predict an attribute with binary value (Yes/No) using SVM.
All my attributes which belong to the training set are text attributes.
I understand that I have to convert my outcome as double (0.0/1.0). But I
donot understand how to deal with my explanatory variables which are also
Hi,
You need to convert your text to vector space model:
http://en.wikipedia.org/wiki/Vector_space_model
and then pass it to SVM. As far as I know, in previous versions of MLlib there
was a special class for doing this:
Hi Alexander,
Thanks for your prompt response. Earlier I was executing this Prediction
using Weka only. But now we are moving to a huge dataset and hence to Apache
Spark MLLib. Is there any other way to convert to libSVM format? Or is there
any other simpler algorithm that I can use in mllib?
Hi Imk,
There is a number of libraries and scripts to convert text to libsvm format, if
you just type libsvm format converter in search engine. Unfortunately I
cannot recommend a specific one, except the one that is built in Weka. I use it
for test purposes, and for big experiments it is
On Tue, Jun 24, 2014 at 12:28 PM, Ulanov, Alexander
alexander.ula...@hp.com wrote:
You need to convert your text to vector space model:
http://en.wikipedia.org/wiki/Vector_space_model
and then pass it to SVM. As far as I know, in previous versions of MLlib
there was a special class for doing
Hi again,
I've finally solved the problem below, it was due to an old 1.0.0-rc3 spark jar
lying around in my .m2 directory which was used when I compiled my spark
applications (with maven).
Christophe.
On 20/06/2014 18:13, Christophe Préaud wrote:
Hi,
Since I migrated to spark 1.0.0, a
Anyone knows anything about it? Or should I actually move this topic to a
MLlib specif mailing list? Any information is appreciated! Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html
Sent from
I'm interested in this topic too :)
Are the MLLib core devs on this list ?
E/
2014-06-24 14:19 GMT+02:00 holdingonrobin robinholdin...@gmail.com:
Anyone knows anything about it? Or should I actually move this topic to a
MLlib specif mailing list? Any information is appreciated! Thanks!
run your spark app in client mode together with a spray rest service, that
the front end can talk to
On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com
wrote:
Hi all,
So far, I run my spark jobs with spark-shell or spark-submit command. I'd
like to go further and I wonder
I have a use case where I cannot figure out the spark streaming way to do
it.
Given two kafka topics corresponding to two different types of events A and
B. For each element from topic A correspond an element from topic B.
Unfortunately elements can arrive separately by hours.
The aggregation
There is a method in org.apache.spark.mllib.util.MLUtils called kFold
which will automatically partition your dataset for you into k train/test
splits at which point you can build k different models and aggregate the
results.
For example (a very rough sketch - assuming I want to do 10-fold cross
Hi Randy and Gino,
The issue is that standalone-cluster mode is not officially supported.
Please use standalone-client mode instead, i.e. specify --deploy-mode
client in spark-submit, or simply leave out this config because it defaults
to client mode.
Unfortunately, this is not currently
We need a centralized spark logging solution. Ideally, it should:
* Allow any Spark process to log at multiple levels (info, warn,
debug) using a single line, similar to log4j
* All logs should go to a central location - so, to read the logs, we
don't need to check each worker by itself
*
Thanks Evan! I think it works!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
With the following pseudo-code,
val rdd1 = sc.sequenceFile(...) // has 100 partitions
val rdd2 = rdd1.coalesce(100)
val rdd3 = rdd2 map { ... }
val rdd4 = rdd3.coalesce(2)
val rdd5 = rdd4.saveAsTextFile(...) // want only two output files
I would expect the parallelism of the map() operation to
Hello Tilak,
1. I get error Not found: type RDD error. Can someone please tell me which jars
do I need to add as external jars and what dhoulf I add iunder import
statements so that this error will go away.
Do you not see any issues with the import statements?
Add the
Hi all,
I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the
executor. The PairFunction used in the call is null for some reason. Any
comments/help would be appreciated!
My setup is,
* Java 7
* Spark 1.0.0
* Hadoop 2.0.0-mr1-cdh4.6.0
Here¹s the code snippet.
import
Are you trying to process data as part of the same Job(till same spark
context), then all you have to do is cache the output rdd of your
processing. It'll run your processing once cache the results for future
tasks, unless your node caching the rdd goes down.
if you are trying to retain it for
I have seen this when I prevent spilling of shuffle data on disk. Can you
change shuffle memory fraction. Is your data spilling to disk?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 23, 2014 at 12:09 PM,
Hi Sebastien,
Are you using Pyspark by any chance, is that working for you (post the
patch?)
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco fedechi...@gmail.com wrote:
I'm
did you try to register the class in Kryo serializer?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Mon, Jun 23, 2014 at 7:00 PM, rrussell25 rrussel...@gmail.com wrote:
Thanks for pointer...tried Kryo and ran into a
I'm trying to link a spark slave with an already-setup master, using:
$SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077
However the result shows that it cannot open a log file it is supposed to
create:
failed to launch org.apache.spark.deploy.worker.Worker:
tail: cannot open
I haven't setup a passwordless login from slave to master node yet (I was
under impression that this is not necessary since they communicate using
port 7077)
--
View this message in context:
How about this..
map it to key,value pair, then reducebykey using max operation
Then in the rdd you can do join with your lookup data reduce (if you only
wanna lookup 2 values then you canuse lookup directly as well).
PS: these are list of operations in Scala, I am not aware how far pyspark
api
My app works fine under Spark 0.9. I just tried upgrading to Spark
1.0, by downloading the Spark distro to a dir, changing the sbt file,
and running sbt assembly, but I get now NoSuchMethodErrors when trying
to use spark-submit.
I copied in the SimpleApp example from
This would be really useful. Especially for Shark where shift of
partitioning effects all subsequent queries unless task scheduling time
beats spark.locality.wait. Can cause overall low performance for all
subsequent tasks.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Hi guys,
I have a small question. I want to create a Worker class which using
ElasticClient to make query to elasticsearch. (I want to enrich my data
with geo search result).
How can I do that? I try to create a worker instance with ES host/port
parameter but spark throw an exceptino (my class
Not really. You are better off using a cluster manager like Mesos or Yarn
for this.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 24, 2014 at 11:35 AM, Sirisha Devineni
sirisha_devin...@persistent.co.in wrote:
HDFS driver keeps changing breaking compatibility, hence all the build
versions. If you dont use HDFS/YARN then you can safely ignore it.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Tue, Jun 24, 2014 at 12:16 PM,
So do you get 2171 as the output for that command? That command tells you
how many partitions your RDD has, so it’s good to first confirm that rdd1
has as many partitions as you think it has.
On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert alex.boisv...@gmail.com
wrote:
It's actually a set of
Yes, the subgraph operator takes a vertex predicate and keeps only the
edges where both vertices satisfy the predicate, so it will work as long as
you can express the sublist in terms of a vertex predicate.
If that's not possible, you can still obtain the same effect, but you'll
have to use
make sure all queries are called through class methods and wrap your query
info with a class having only simple properties (strings, collections etc).
If you can't find such wrapper you can also use SerializableWritable wrapper
out-of-the-box, but its not recommended. (developer-api and make fat
To be clear number of map tasks are determined by number of partitions
inside the rdd hence the suggestion by Nicholas.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas
Ok but in this case where can I store the ES connection? Or all document
create new ES connection inside the worker?
--
Skype: boci13, Hangout: boci.b...@gmail.com
On
Mostly ES client is not serializable for you. You can do 3 workarounds,
1. Switch to kryo serialization, register the client in kryo , might solve
your serialization issue
2. Use mappartition for all your data initialize your client in the
mappartition code, this will create client for each
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh'
that allows you to propagate your config files to a number of master and
slave nodes. However I haven't use it myself
--
View this message in context:
I got 'NoSuchFieldError' which is of the same type. its definitely a
dependency jar conflict. spark driver will load jars of itself which in
recent version get many dependencies that are 1-2 years old. And if your
newer version dependency is in the same package it will be shaded (Java's
first come
I'm afraid persisting connection across two tasks is a dangerous act as they
can't be guaranteed to be executed on the same machine. Your ES server may
think its a man-in-the-middle attack!
I think its possible to invoke a static method that give you a connection in
a local 'pool', so nothing
Its not used as default serializer for some issues with compatibility
requirement to register the classes..
Which part are you getting as nonserializable... you need to serialize that
class if you are sending it to spark workers inside a map, reduce ,
mappartition or any of the operations on
On 6/24/14, Peng Cheng pc...@uow.edu.au wrote:
I got 'NoSuchFieldError' which is of the same type. its definitely a
dependency jar conflict. spark driver will load jars of itself which in
recent version get many dependencies that are 1-2 years old. And if your
newer version dependency is in
Dear Aaron,Thanks for your help. I am still facing few problems.
I am using a 3rd party library (jar file) under the hood when I call
jc_-score. Each call to jc_-score will generate a array of doubles. It is
basically score of the current sentence with every sentence in the destrdd
generated
Hi there,Here is how I specify it during the compilation.
scalac -classpath
Andrew,
Thanks for your answer. It validates our finding. Unfortunately, client mode
assumes that I'm running in a privilege node. What I mean by privilege is a
node that has net access to all the workers and vice versa. This is a big
assumption to make and unreasonable in certain
Thanks for the response.
I would also like to know, What happens if a slave node is removed while it
is processing some data. Does master send that data for
re-processing/resume-process to other slave nodes ? And does it happen with
the help of HDFS?
Thanks,
Srujana
--
View this message in
Hi Mayur,
I use primarily Scala, but I tested with pyspark, and it's working fine too
post the patch.
Thanks,
- Sebastien
On Tue, Jun 24, 2014 at 6:08 PM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
Hi Sebastien,
Are you using Pyspark by any chance, is that working for you (post the
I'm deploying a cluster to Amazon EC2, trying to override its internal ip
addresses with public dns
I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2
public DNS]
But it doesn't change anything on the web UI, it still shows internal ip
address
Spark Master at
Hi!
According to
https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging,
changing log-level is just a matter of creating a log4j.properties (which
is in the classpath of spark) and changing log level there for the root
logger. I did this steps on every node in the cluster
Michael,
apparently, the parameter auto.offset.reset has a different meaning
in Spark's Kafka implementation than what is described in the
documentation.
The Kafka docs at https://kafka.apache.org/documentation.html
specify the effect of auto.offset.reset as:
What to do when there is no initial
52 matches
Mail list logo