How to Reload Spark Configuration Files

2014-06-24 Thread Sirisha Devineni
Hi All, I am working with Spark to add new slaves automatically when there is more data to be processed by the cluster. During this process there is question arisen, after adding/removing new slave node to/from the spark cluster do we need to restart master and other existing slaves in the

How data is distributed while processing in spark cluster?

2014-06-24 Thread srujana
Hi, I am working on auto scaling spark cluster. I would like to know how master distributes the data to the slaves for processing in detail. Any information on this would be helpful. Thanks, Srujana -- View this message in context:

Questions regarding different spark pre-built packages

2014-06-24 Thread Sourav Chandra
Hi, I am just curious to know what are the difference between the prebuilt packages for Hadoop1, 2, CDH etc. I am using spark standalone cluster and we dont use hadoop at all. Can we use any one of the pre-buil;t packages OR we have to run make-distribution.sh script from the code? Thanks, --

Using Spark as web app backend

2014-06-24 Thread Jaonary Rabarisoa
Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder how to use spark as a backend of a web application. Specificaly, I want a frontend application ( build with nodejs ) to communicate with spark on the backend, so that every query

Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread lmk
Hi, I am trying to predict an attribute with binary value (Yes/No) using SVM. All my attributes which belong to the training set are text attributes. I understand that I have to convert my outcome as double (0.0/1.0). But I donot understand how to deal with my explanatory variables which are also

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Ulanov, Alexander
Hi, You need to convert your text to vector space model: http://en.wikipedia.org/wiki/Vector_space_model and then pass it to SVM. As far as I know, in previous versions of MLlib there was a special class for doing this:

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread lmk
Hi Alexander, Thanks for your prompt response. Earlier I was executing this Prediction using Weka only. But now we are moving to a huge dataset and hence to Apache Spark MLLib. Is there any other way to convert to libSVM format? Or is there any other simpler algorithm that I can use in mllib?

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Ulanov, Alexander
Hi Imk, There is a number of libraries and scripts to convert text to libsvm format, if you just type libsvm format converter in search engine. Unfortunately I cannot recommend a specific one, except the one that is built in Weka. I use it for test purposes, and for big experiments it is

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Sean Owen
On Tue, Jun 24, 2014 at 12:28 PM, Ulanov, Alexander alexander.ula...@hp.com wrote: You need to convert your text to vector space model: http://en.wikipedia.org/wiki/Vector_space_model and then pass it to SVM. As far as I know, in previous versions of MLlib there was a special class for doing

Re: broadcast not working in yarn-cluster mode

2014-06-24 Thread Christophe Préaud
Hi again, I've finally solved the problem below, it was due to an old 1.0.0-rc3 spark jar lying around in my .m2 directory which was used when I compiled my spark applications (with maven). Christophe. On 20/06/2014 18:13, Christophe Préaud wrote: Hi, Since I migrated to spark 1.0.0, a

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread holdingonrobin
Anyone knows anything about it? Or should I actually move this topic to a MLlib specif mailing list? Any information is appreciated! Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html Sent from

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Eustache DIEMERT
I'm interested in this topic too :) Are the MLLib core devs on this list ? E/ 2014-06-24 14:19 GMT+02:00 holdingonrobin robinholdin...@gmail.com: Anyone knows anything about it? Or should I actually move this topic to a MLlib specif mailing list? Any information is appreciated! Thanks!

Re: Using Spark as web app backend

2014-06-24 Thread Koert Kuipers
run your spark app in client mode together with a spray rest service, that the front end can talk to On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa jaon...@gmail.com wrote: Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder

Streaming aggregation

2014-06-24 Thread john levingston
I have a use case where I cannot figure out the spark streaming way to do it. Given two kafka topics corresponding to two different types of events A and B. For each element from topic A correspond an element from topic B. Unfortunately elements can arrive separately by hours. The aggregation

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks
There is a method in org.apache.spark.mllib.util.MLUtils called kFold which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross

Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Andrew Or
Hi Randy and Gino, The issue is that standalone-cluster mode is not officially supported. Please use standalone-client mode instead, i.e. specify --deploy-mode client in spark-submit, or simply leave out this config because it defaults to client mode. Unfortunately, this is not currently

Centralized Spark Logging solution

2014-06-24 Thread Robert James
We need a centralized spark logging solution. Ideally, it should: * Allow any Spark process to log at multiple levels (info, warn, debug) using a single line, similar to log4j * All logs should go to a central location - so, to read the logs, we don't need to check each worker by itself *

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread holdingonrobin
Thanks Evan! I think it works! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8188.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
With the following pseudo-code, val rdd1 = sc.sequenceFile(...) // has 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { ... } val rdd4 = rdd3.coalesce(2) val rdd5 = rdd4.saveAsTextFile(...) // want only two output files I would expect the parallelism of the map() operation to

RE: Basic Scala and Spark questions

2014-06-24 Thread Muttineni, Vinay
Hello Tilak, 1. I get error Not found: type RDD error. Can someone please tell me which jars do I need to add as external jars and what dhoulf I add iunder import statements so that this error will go away. Do you not see any issues with the import statements? Add the

JavaRDD.mapToPair throws NPE

2014-06-24 Thread Mingyu Kim
Hi all, I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the executor. The PairFunction used in the call is null for some reason. Any comments/help would be appreciated! My setup is, * Java 7 * Spark 1.0.0 * Hadoop 2.0.0-mr1-cdh4.6.0 Here¹s the code snippet. import

Re: Persistent Local Node variables

2014-06-24 Thread Mayur Rustagi
Are you trying to process data as part of the same Job(till same spark context), then all you have to do is cache the output rdd of your processing. It'll run your processing once cache the results for future tasks, unless your node caching the rdd goes down. if you are trying to retain it for

Re: Kafka Streaming - Error Could not compute split

2014-06-24 Thread Mayur Rustagi
I have seen this when I prevent spilling of shuffle data on disk. Can you change shuffle memory fraction. Is your data spilling to disk? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 23, 2014 at 12:09 PM,

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-24 Thread Mayur Rustagi
Hi Sebastien, Are you using Pyspark by any chance, is that working for you (post the patch?) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco fedechi...@gmail.com wrote: I'm

Re: Serialization problem in Spark

2014-06-24 Thread Mayur Rustagi
did you try to register the class in Kryo serializer? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Mon, Jun 23, 2014 at 7:00 PM, rrussell25 rrussel...@gmail.com wrote: Thanks for pointer...tried Kryo and ran into a

Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I'm trying to link a spark slave with an already-setup master, using: $SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077 However the result shows that it cannot open a log file it is supposed to create: failed to launch org.apache.spark.deploy.worker.Worker: tail: cannot open

Re: Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I haven't setup a passwordless login from slave to master node yet (I was under impression that this is not necessary since they communicate using port 7077) -- View this message in context:

Re: Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-24 Thread Mayur Rustagi
How about this.. map it to key,value pair, then reducebykey using max operation Then in the rdd you can do join with your lookup data reduce (if you only wanna lookup 2 values then you canuse lookup directly as well). PS: these are list of operations in Scala, I am not aware how far pyspark api

Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Robert James
My app works fine under Spark 0.9. I just tried upgrading to Spark 1.0, by downloading the Spark distro to a dir, changing the sbt file, and running sbt assembly, but I get now NoSuchMethodErrors when trying to use spark-submit. I copied in the SimpleApp example from

Re: balancing RDDs

2014-06-24 Thread Mayur Rustagi
This would be really useful. Especially for Shark where shift of partitioning effects all subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com

ElasticSearch enrich

2014-06-24 Thread boci
Hi guys, I have a small question. I want to create a Worker class which using ElasticClient to make query to elasticsearch. (I want to enrich my data with geo search result). How can I do that? I try to create a worker instance with ES host/port parameter but spark throw an exceptino (my class

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Mayur Rustagi
Not really. You are better off using a cluster manager like Mesos or Yarn for this. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 24, 2014 at 11:35 AM, Sirisha Devineni sirisha_devin...@persistent.co.in wrote:

Re: Questions regarding different spark pre-built packages

2014-06-24 Thread Mayur Rustagi
HDFS driver keeps changing breaking compatibility, hence all the build versions. If you dont use HDFS/YARN then you can safely ignore it. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Tue, Jun 24, 2014 at 12:16 PM,

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
So do you get 2171 as the output for that command? That command tells you how many partitions your RDD has, so it’s good to first confirm that rdd1 has as many partitions as you think it has. ​ On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert alex.boisv...@gmail.com wrote: It's actually a set of

Re: Graphx SubGraph

2014-06-24 Thread Ankur Dave
Yes, the subgraph operator takes a vertex predicate and keeps only the edges where both vertices satisfy the predicate, so it will work as long as you can express the sublist in terms of a vertex predicate. If that's not possible, you can still obtain the same effect, but you'll have to use

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
make sure all queries are called through class methods and wrap your query info with a class having only simple properties (strings, collections etc). If you can't find such wrapper you can also use SerializableWritable wrapper out-of-the-box, but its not recommended. (developer-api and make fat

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Mayur Rustagi
To be clear number of map tasks are determined by number of partitions inside the rdd hence the suggestion by Nicholas. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi https://twitter.com/mayur_rustagi On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas

Re: ElasticSearch enrich

2014-06-24 Thread boci
Ok but in this case where can I store the ES connection? Or all document create new ES connection inside the worker? -- Skype: boci13, Hangout: boci.b...@gmail.com On

Re: ElasticSearch enrich

2014-06-24 Thread Mayur Rustagi
Mostly ES client is not serializable for you. You can do 3 workarounds, 1. Switch to kryo serialization, register the client in kryo , might solve your serialization issue 2. Use mappartition for all your data initialize your client in the mappartition code, this will create client for each

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Peng Cheng
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh' that allows you to propagate your config files to a number of master and slave nodes. However I haven't use it myself -- View this message in context:

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Peng Cheng
I got 'NoSuchFieldError' which is of the same type. its definitely a dependency jar conflict. spark driver will load jars of itself which in recent version get many dependencies that are 1-2 years old. And if your newer version dependency is in the same package it will be shaded (Java's first come

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
I'm afraid persisting connection across two tasks is a dangerous act as they can't be guaranteed to be executed on the same machine. Your ES server may think its a man-in-the-middle attack! I think its possible to invoke a static method that give you a connection in a local 'pool', so nothing

Re: ElasticSearch enrich

2014-06-24 Thread Mayur Rustagi
Its not used as default serializer for some issues with compatibility requirement to register the classes.. Which part are you getting as nonserializable... you need to serialize that class if you are sending it to spark workers inside a map, reduce , mappartition or any of the operations on

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Robert James
On 6/24/14, Peng Cheng pc...@uow.edu.au wrote: I got 'NoSuchFieldError' which is of the same type. its definitely a dependency jar conflict. spark driver will load jars of itself which in recent version get many dependencies that are 1-2 years old. And if your newer version dependency is in

RE: DAGScheduler: Failed to run foreach

2014-06-24 Thread Sameer Tilak
Dear Aaron,Thanks for your help. I am still facing few problems. I am using a 3rd party library (jar file) under the hood when I call jc_-score. Each call to jc_-score will generate a array of doubles. It is basically score of the current sentence with every sentence in the destrdd generated

RE: Basic Scala and Spark questions

2014-06-24 Thread Sameer Tilak
Hi there,Here is how I specify it during the compilation. scalac -classpath

Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Gino Bustelo
Andrew, Thanks for your answer. It validates our finding. Unfortunately, client mode assumes that I'm running in a privilege node. What I mean by privilege is a node that has net access to all the workers and vice versa. This is a big assumption to make and unreasonable in certain

Re: How data is distributed while processing in spark cluster?

2014-06-24 Thread srujana
Thanks for the response. I would also like to know, What happens if a slave node is removed while it is processing some data. Does master send that data for re-processing/resume-process to other slave nodes ? And does it happen with the help of HDFS? Thanks, Srujana -- View this message in

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-24 Thread Sébastien Rainville
Hi Mayur, I use primarily Scala, but I tested with pyspark, and it's working fine too post the patch. Thanks, - Sebastien On Tue, Jun 24, 2014 at 6:08 PM, Mayur Rustagi mayur.rust...@gmail.com wrote: Hi Sebastien, Are you using Pyspark by any chance, is that working for you (post the

Does PUBLIC_DNS environment parameter really works?

2014-06-24 Thread Peng Cheng
I'm deploying a cluster to Amazon EC2, trying to override its internal ip addresses with public dns I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2 public DNS] But it doesn't change anything on the web UI, it still shows internal ip address Spark Master at

Changing log level of spark

2014-06-24 Thread Philip Limbeck
Hi! According to https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging, changing log-level is just a matter of creating a log4j.properties (which is in the classpath of spark) and changing log level there for the root logger. I did this steps on every node in the cluster

Re: Kafka client - specify offsets?

2014-06-24 Thread Tobias Pfeiffer
Michael, apparently, the parameter auto.offset.reset has a different meaning in Spark's Kafka implementation than what is described in the documentation. The Kafka docs at https://kafka.apache.org/documentation.html specify the effect of auto.offset.reset as: What to do when there is no initial