Using Spark as web app backend

2014-06-24 Thread Jaonary Rabarisoa
Hi all, So far, I run my spark jobs with spark-shell or spark-submit command. I'd like to go further and I wonder how to use spark as a backend of a web application. Specificaly, I want a frontend application ( build with nodejs ) to communicate with spark on the backend, so that every query from

Re: Using Spark as web app backend

2014-06-24 Thread Jörn Franke
Hi, You could use sock.js / websockets on the front end, so you can notify the user when the job is finished. You can regularly poll the URL of the job to check its status from.your node.js app - at the moment I do not know an out of the box solution. Nicer would be if your job sends a message v

Re: pyspark-Failed to run first

2014-06-24 Thread angel2014
It's ... kind of weird if I try to execute this cotizas = sc.textFile("A_ko") print cotizas.take(10) it doesn't work, but if I remove only one "A" character from this file ... it's all OK ... At first I thought it was due to the number of splits or something like that ... but I downloaded

Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread lmk
Hi, I am trying to predict an attribute with binary value (Yes/No) using SVM. All my attributes which belong to the training set are text attributes. I understand that I have to convert my outcome as double (0.0/1.0). But I donot understand how to deal with my explanatory variables which are also

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Ulanov, Alexander
Hi, You need to convert your text to vector space model: http://en.wikipedia.org/wiki/Vector_space_model and then pass it to SVM. As far as I know, in previous versions of MLlib there was a special class for doing this: https://github.com/amplab/MLI/blob/master/src/main/scala/feat/NGrams.scala.

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread lmk
Hi Alexander, Thanks for your prompt response. Earlier I was executing this Prediction using Weka only. But now we are moving to a huge dataset and hence to Apache Spark MLLib. Is there any other way to convert to libSVM format? Or is there any other simpler algorithm that I can use in mllib? Than

RE: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Ulanov, Alexander
Hi Imk, There is a number of libraries and scripts to convert text to libsvm format, if you just type " libsvm format converter" in search engine. Unfortunately I cannot recommend a specific one, except the one that is built in Weka. I use it for test purposes, and for big experiments it is eas

Re: Prediction using Classification with text attributes in Apache Spark MLLib

2014-06-24 Thread Sean Owen
On Tue, Jun 24, 2014 at 12:28 PM, Ulanov, Alexander wrote: > You need to convert your text to vector space model: > http://en.wikipedia.org/wiki/Vector_space_model > and then pass it to SVM. As far as I know, in previous versions of MLlib > there was a special class for doing this: > https://gi

Re: broadcast not working in yarn-cluster mode

2014-06-24 Thread Christophe Préaud
Hi again, I've finally solved the problem below, it was due to an old 1.0.0-rc3 spark jar lying around in my .m2 directory which was used when I compiled my spark applications (with maven). Christophe. On 20/06/2014 18:13, Christophe Préaud wrote: > Hi, > > Since I migrated to spark 1.0.0, a

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread holdingonrobin
Anyone knows anything about it? Or should I actually move this topic to a MLlib specif mailing list? Any information is appreciated! Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8172.html Sent from

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Eustache DIEMERT
I'm interested in this topic too :) Are the MLLib core devs on this list ? E/ 2014-06-24 14:19 GMT+02:00 holdingonrobin : > Anyone knows anything about it? Or should I actually move this topic to a > MLlib specif mailing list? Any information is appreciated! Thanks! > > > > -- > View this mess

Re: Using Spark as web app backend

2014-06-24 Thread Koert Kuipers
run your spark app in client mode together with a spray rest service, that the front end can talk to On Tue, Jun 24, 2014 at 3:12 AM, Jaonary Rabarisoa wrote: > Hi all, > > So far, I run my spark jobs with spark-shell or spark-submit command. I'd > like to go further and I wonder how to use spa

Streaming aggregation

2014-06-24 Thread john levingston
I have a use case where I cannot figure out the spark streaming way to do it. Given two kafka topics corresponding to two different types of events A and B. For each element from topic A correspond an element from topic B. Unfortunately elements can arrive separately by hours. The aggregation

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread Evan R. Sparks
There is a method in org.apache.spark.mllib.util.MLUtils called "kFold" which will automatically partition your dataset for you into k train/test splits at which point you can build k different models and aggregate the results. For example (a very rough sketch - assuming I want to do 10-fold cross

Integrate Spark Editor with Hue for source compiled installation of spark/spark-jobServer

2014-06-24 Thread Sunita Arvind
Hello Experts, I am attempting to integrate Spark Editor with Hue on CDH5.0.1. I have the spark installation build manually from the sources for spark1.0.0. I am able to integrate this with cloudera manager. Background: --- We have a 3 node VM cluster with CDH5.0.1 We requried spa

Setting user permissions for Spark and Shark

2014-06-24 Thread ajatix
Hi I am currently running a private mesos cluster of 1+3 machines for running Spark and Shark applications on it. I've currently installed everything from an admin account. I now want to run them from another account restricting access to the configuration settings. Any suggestions on how to go ab

Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Andrew Or
Hi Randy and Gino, The issue is that standalone-cluster mode is not officially supported. Please use standalone-client mode instead, i.e. specify --deploy-mode client in spark-submit, or simply leave out this config because it defaults to client mode. Unfortunately, this is not currently document

Centralized Spark Logging solution

2014-06-24 Thread Robert James
We need a centralized spark logging solution. Ideally, it should: * Allow any Spark process to log at multiple levels (info, warn, debug) using a single line, similar to log4j * All logs should go to a central location - so, to read the logs, we don't need to check each worker by itself * Ideally

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-24 Thread Andrew Or
Hi all, The short answer is that standalone-cluster mode through spark-submit is broken (and in fact not officially supported). Please use standalone-client mode instead. The long answer is provided here: http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3cCAMJOb8m6gF9B3W=p12hi88mex

LinearRegression giving different weights in Spark 1.0 and Spark 0.9

2014-06-24 Thread fintis
Hi, Maybe it is I doing something wrong but I suspect the linear regression is behaving differently in Spark 1.0 as compared to Spark 0.9. I have the following data points: 23 9515 7 2.58 113 0.77 0.964 9.5 9 22 9830 8 1.15 126 0.38 0.964 9.5 9 14 10130 9 0.81 129 0.74 0.827 9.6 9 10 10250 11 0.

Re: How to use K-fold validation in spark-1.0?

2014-06-24 Thread holdingonrobin
Thanks Evan! I think it works! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-K-fold-validation-in-spark-1-0-tp8142p8188.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
With the following pseudo-code, val rdd1 = sc.sequenceFile(...) // has > 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { ... } val rdd4 = rdd3.coalesce(2) val rdd5 = rdd4.saveAsTextFile(...) // want only two output files I would expect the parallelism of the map() operation to

RE: Basic Scala and Spark questions

2014-06-24 Thread Muttineni, Vinay
Hello Tilak, 1. I get error Not found: type RDD error. Can someone please tell me which jars do I need to add as external jars and what dhoulf I add iunder import statements so that this error will go away. Do you not see any issues with the import statements? Add the spark-assembly-1.0.0-hadoop2

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
What do you get for rdd1._jrdd.splits().size()? You might think you’re getting > 100 partitions, but it may not be happening. ​ On Tue, Jun 24, 2014 at 3:50 PM, Alex Boisvert wrote: > With the following pseudo-code, > > val rdd1 = sc.sequenceFile(...) // has > 100 partitions > val rdd2 = rdd1.c

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
It's actually a set of 2171 S3 files, with an average size of about 18MB. On Tue, Jun 24, 2014 at 1:13 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > What do you get for rdd1._jrdd.splits().size()? You might think you’re > getting > 100 partitions, but it may not be happening. > ​ >

Spark switch to debug loglevel

2014-06-24 Thread Philip Limbeck
Hi! According to https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging, changing log-level is just a matter of creating a log4j.properties (which is in the classpath of spark) and changing log level there for the root logger. I did this steps on every node in the cluster (mast

Graphx SubGraph

2014-06-24 Thread aymanshalaby
Hi guys, I am a newbie with Spark/Graphx. We are considering using Graphx in production. Our 1st use case is: given a sublist of vertices in the graph, we want to return the induced edges between the vertices of this sublist. Please correct me if I am wrong. Is that what does the subgraph functio

JavaRDD.mapToPair throws NPE

2014-06-24 Thread Mingyu Kim
Hi all, I¹m trying to use JavaRDD.mapToPair(), but it fails with NPE on the executor. The PairFunction used in the call is null for some reason. Any comments/help would be appreciated! My setup is, * Java 7 * Spark 1.0.0 * Hadoop 2.0.0-mr1-cdh4.6.0 Here¹s the code snippet. > import org.apache.sp

Re: Persistent Local Node variables

2014-06-24 Thread Mayur Rustagi
Are you trying to process data as part of the same Job(till same spark context), then all you have to do is cache the output rdd of your processing. It'll run your processing once & cache the results for future tasks, unless your node caching the rdd goes down. if you are trying to retain it for qu

Re: Kafka Streaming - Error Could not compute split

2014-06-24 Thread Mayur Rustagi
I have seen this when I prevent spilling of shuffle data on disk. Can you change shuffle memory fraction. Is your data spilling to disk? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Jun 23, 2014 at 12:09 PM, Kanwa

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-24 Thread Mayur Rustagi
Hi Sebastien, Are you using Pyspark by any chance, is that working for you (post the patch?) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Jun 23, 2014 at 1:51 PM, Fedechicco wrote: > I'm getting the same behavio

Re: Serialization problem in Spark

2014-06-24 Thread Mayur Rustagi
did you try to register the class in Kryo serializer? Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, Jun 23, 2014 at 7:00 PM, rrussell25 wrote: > Thanks for pointer...tried Kryo and ran into a strange error: > > o

Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I'm trying to link a spark slave with an already-setup master, using: $SPARK_HOME/sbin/start-slave.sh spark://ip-172-31-32-12:7077 However the result shows that it cannot open a log file it is supposed to create: failed to launch org.apache.spark.deploy.worker.Worker: tail: cannot open '/opt/spa

Re: Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
I haven't setup a passwordless login from slave to master node yet (I was under impression that this is not necessary since they communicate using port 7077) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-informati

Re: Serialization problem in Spark

2014-06-24 Thread Peng Cheng
I encounter the same problem with hadoop.fs.Configuration (very complex, unserializable class) basically if your closure contains any instance (not constant object/singleton! they are in the jar, not closure) that doesn't inherit Serializable, or their properties doesn't inherit Serializable, you a

Re: Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-24 Thread Mayur Rustagi
How about this.. map it to key,value pair, then reducebykey using max operation Then in the rdd you can do join with your lookup data & reduce (if you only wanna lookup 2 values then you canuse lookup directly as well). PS: these are list of operations in Scala, I am not aware how far pyspark api i

Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Robert James
My app works fine under Spark 0.9. I just tried upgrading to Spark 1.0, by downloading the Spark distro to a dir, changing the sbt file, and running sbt assembly, but I get now NoSuchMethodErrors when trying to use spark-submit. I copied in the SimpleApp example from http://spark.apache.org/docs/

Re: balancing RDDs

2014-06-24 Thread Mayur Rustagi
This would be really useful. Especially for Shark where shift of partitioning effects all subsequent queries unless task scheduling time beats spark.locality.wait. Can cause overall low performance for all subsequent tasks. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur

ElasticSearch enrich

2014-06-24 Thread boci
Hi guys, I have a small question. I want to create a "Worker" class which using ElasticClient to make query to elasticsearch. (I want to enrich my data with geo search result). How can I do that? I try to create a worker instance with ES host/port parameter but spark throw an exceptino (my class

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Mayur Rustagi
Not really. You are better off using a cluster manager like Mesos or Yarn for this. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Jun 24, 2014 at 11:35 AM, Sirisha Devineni < sirisha_devin...@persistent.co.in> wrot

Re: How data is distributed while processing in spark cluster?

2014-06-24 Thread Mayur Rustagi
Using HDFS locality. The workers call for the data from hdfs/queue etc. Unless you use parallelize then its sent from driver (typically on the master) to the worker nodes. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On T

Re: Questions regarding different spark pre-built packages

2014-06-24 Thread Mayur Rustagi
HDFS driver keeps changing & breaking compatibility, hence all the build versions. If you dont use HDFS/YARN then you can safely ignore it. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Jun 24, 2014 at 12:16 PM, So

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
So do you get 2171 as the output for that command? That command tells you how many partitions your RDD has, so it’s good to first confirm that rdd1 has as many partitions as you think it has. ​ On Tue, Jun 24, 2014 at 4:22 PM, Alex Boisvert wrote: > It's actually a set of 2171 S3 files, with an

Re: Graphx SubGraph

2014-06-24 Thread Ankur Dave
Yes, the subgraph operator takes a vertex predicate and keeps only the edges where both vertices satisfy the predicate, so it will work as long as you can express the sublist in terms of a vertex predicate. If that's not possible, you can still obtain the same effect, but you'll have to use lower-

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
make sure all queries are called through class methods and wrap your query info with a class having only simple properties (strings, collections etc). If you can't find such wrapper you can also use SerializableWritable wrapper out-of-the-box, but its not recommended. (developer-api and make fat cl

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Mayur Rustagi
To be clear number of map tasks are determined by number of partitions inside the rdd hence the suggestion by Nicholas. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Jun 25, 2014 at 4:17 AM, Nicholas Chammas < nich

Re: ElasticSearch enrich

2014-06-24 Thread boci
Ok but in this case where can I store the ES connection? Or all document create new ES connection inside the worker? -- Skype: boci13, Hangout: boci.b...@gmail.com On W

Re: ElasticSearch enrich

2014-06-24 Thread Mayur Rustagi
Mostly ES client is not serializable for you. You can do 3 workarounds, 1. Switch to kryo serialization, register the client in kryo , might solve your serialization issue 2. Use mappartition for all your data & initialize your client in the mappartition code, this will create client for each parti

Re: How to Reload Spark Configuration Files

2014-06-24 Thread Peng Cheng
I've read somewhere that in 1.0 there is a bash tool called 'spark-config.sh' that allows you to propagate your config files to a number of master and slave nodes. However I haven't use it myself -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-Reload-

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Peng Cheng
I got 'NoSuchFieldError' which is of the same type. its definitely a dependency jar conflict. spark driver will load jars of itself which in recent version get many dependencies that are 1-2 years old. And if your newer version dependency is in the same package it will be shaded (Java's first come

Re: ElasticSearch enrich

2014-06-24 Thread boci
I using elastic4s inside my ESWorker class. ESWorker now only contain two field, host:String, port:Int. Now Inside the "findNearestCity" method I create ElasticClient (elastic4s) connection. What's wrong with my class? I need to serialize ElasticClient? mappartition is sounds good but I still got N

Re: ElasticSearch enrich

2014-06-24 Thread Peng Cheng
I'm afraid persisting connection across two tasks is a dangerous act as they can't be guaranteed to be executed on the same machine. Your ES server may think its a man-in-the-middle attack! I think its possible to invoke a static method that give you a connection in a local 'pool', so nothing will

Re: ElasticSearch enrich

2014-06-24 Thread Mayur Rustagi
Its not used as default serializer for some issues with compatibility & requirement to register the classes.. Which part are you getting as nonserializable... you need to serialize that class if you are sending it to spark workers inside a map, reduce , mappartition or any of the operations on RDD

Re: Upgrading to Spark 1.0.0 causes NoSuchMethodError

2014-06-24 Thread Robert James
On 6/24/14, Peng Cheng wrote: > I got 'NoSuchFieldError' which is of the same type. its definitely a > dependency jar conflict. spark driver will load jars of itself which in > recent version get many dependencies that are 1-2 years old. And if your > newer version dependency is in the same packag

Re: ElasticSearch enrich

2014-06-24 Thread Holden Karau
So I'm giving a talk at the Spark summit on using Spark & ElasticSearch, but for now if you want to see a simple demo which uses elasticsearch for geo input you can take a look at my quick & dirty implementation with TopTweetsInALocation ( https://github.com/holdenk/elasticsearchspark/blob/master/s

RE: DAGScheduler: Failed to run foreach

2014-06-24 Thread Sameer Tilak
Dear Aaron,Thanks for your help. I am still facing few problems. I am using a 3rd party library (jar file) under the hood when I call jc_->score. Each call to jc_->score will generate a array of doubles. It is basically score of the current sentence with every sentence in the destrdd generated

Re: Spark slave fail to start with wierd error information

2014-06-24 Thread Peng Cheng
anyone encounter this situation? Also, I'm very sure my slave and master are in the same security group, with port 7077 opened -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-slave-fail-to-start-with-wierd-error-information-tp8203p8227.html Sent from

RE: Basic Scala and Spark questions

2014-06-24 Thread Sameer Tilak
Hi there,Here is how I specify it during the compilation. scalac -classpath /apps/software/abc.jar:/apps/software/spark-1.0.0-bin-hadoop1/lib/datanucleus-api-jdo-3.2.1.jar:/apps/software/spark-1.0.0-bin-hadoop1/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:spark-assembly-1.0.0-hadoop1.0.4.jar/datanucle

Re: problem about cluster mode of spark 1.0.0

2014-06-24 Thread Gino Bustelo
Andrew, Thanks for your answer. It validates our finding. Unfortunately, client mode assumes that I'm running in a "privilege node". What I mean by privilege is a node that has net access to all the workers and vice versa. This is a big assumption to make and unreasonable in certain circumstanc

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
Yes. scala> rawLogs.partitions.size res1: Int = 2171 On Tue, Jun 24, 2014 at 4:00 PM, Mayur Rustagi wrote: > To be clear number of map tasks are determined by number of partitions > inside the rdd hence the suggestion by Nicholas. > > Mayur Rustagi > Ph: +1 (760) 203 3257 > http://www.sigmoid

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Alex Boisvert
For the skeptics :), here's a version you can easily reproduce at home: val rdd1 = sc.parallelize(1 to 1000, 100) // force with 100 partitions val rdd2 = rdd1.coalesce(100) val rdd3 = rdd2 map { _ + 1000 } val rdd4 = rdd3.coalesce(2) rdd4.collect() You can see that everything runs as only 2 tasks

Re: partitions, coalesce() and parallelism

2014-06-24 Thread Nicholas Chammas
Ah, here's a better hypothesis. Everything you are doing minus the save() is a transformation, not an action. Since nothing is actually triggered until the save(), Spark may be seeing that the lineage of operations ends with 2 partitions anyway and simplifies accordingly. Two suggestions you can t

Re: How data is distributed while processing in spark cluster?

2014-06-24 Thread srujana
Thanks for the response. I would also like to know, What happens if a slave node is removed while it is processing some data. Does master send that data for re-processing/resume-process to other slave nodes ? And does it happen with the help of HDFS? Thanks, Srujana -- View this message in con

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-24 Thread Sébastien Rainville
Hi Mayur, I use primarily Scala, but I tested with pyspark, and it's working fine too post the patch. Thanks, - Sebastien On Tue, Jun 24, 2014 at 6:08 PM, Mayur Rustagi wrote: > Hi Sebastien, > Are you using Pyspark by any chance, is that working for you (post the > patch?) > > Mayur Rustagi

Does PUBLIC_DNS environment parameter really works?

2014-06-24 Thread Peng Cheng
I'm deploying a cluster to Amazon EC2, trying to override its internal ip addresses with public dns I start a cluster with environment parameter: SPARK_PUBLIC_DNS=[my EC2 public DNS] But it doesn't change anything on the web UI, it still shows internal ip address Spark Master at spark://ip-172-3

Re: Does PUBLIC_DNS environment parameter really works?

2014-06-24 Thread Andrew Or
Hi Peng, What you're looking for is SPARK_MASTER_IP, which defaults to the output of the command "hostname" (see sbin/start-master.sh). What SPARK_PUBLIC_DNS does is it changes what the Master or the Worker advertise to others. If this is set, the links on the Master and Worker web UI will use pu

Re: DAGScheduler: Failed to run foreach

2014-06-24 Thread Aaron Davidson
That IntRef problem is very strange, as it's not related to running a spark job, but rather just interpreting the code in the repl. There are two possibilities I can think of: - Spark was compiled with a different version of Scala than you're running it on. Spark is compiled on Scala 2.10 from Spar

Changing log level of spark

2014-06-24 Thread Philip Limbeck
Hi! According to https://spark.apache.org/docs/0.9.0/configuration.html#configuring-logging, changing log-level is just a matter of creating a log4j.properties (which is in the classpath of spark) and changing log level there for the root logger. I did this steps on every node in the cluster (mast

Re: Kafka client - specify offsets?

2014-06-24 Thread Tobias Pfeiffer
Michael, apparently, the parameter "auto.offset.reset" has a different meaning in Spark's Kafka implementation than what is described in the documentation. The Kafka docs at specify the effect of "auto.offset.reset" as: > What to do when there is no i