Re: Spark serialization issues with third-party libraries

2014-11-24 Thread Arush Kharbanda
Hi You can see my code here . Its a POC to implement UIMA on spark https://bitbucket.org/SigmoidDev/uimaspark https://bitbucket.org/SigmoidDev/uimaspark/src/8476fdf16d84d0f517cce45a8bc1bd3410927464/UIMASpark/src/main/scala/ *UIMAProcessor.scala*?at=master this is the class where the major

Issues about running on client in standalone mode

2014-11-24 Thread LinQili
Hi all:I deployed a spark client in my own machine. I put SPARK in path:` /home/somebody/spark`, and the cluster's worker spark home path is `/home/spark/spark` .While I launched the jar, it shows that: ` AppClient$ClientActor: Executor updated: app-20141124170955-11088/12 is now FAILED

Fwd: 1gb file processing...task doesn't launch on all the node...Unseen exception

2014-11-24 Thread Priya Ch
Hi, I tried with try catch blocks. Infact, inside mapPartitionsWithIndex, method is invoked which does the operation. I put the operations inside the function in try...catch block but thats of no use...still the error persists. Even I commented all the operations and a simple print statement

Re: Issues about running on client in standalone mode

2014-11-24 Thread Akhil Das
How are you submitting the job? Thanks Best Regards On Mon, Nov 24, 2014 at 3:02 PM, LinQili lin_q...@outlook.com wrote: Hi all: I deployed a spark client in my own machine. I put SPARK in path:` /home/somebody/spark`, and the cluster's worker spark home path is `/home/spark/spark` . While

re: How to incrementally compile spark examples using mvn

2014-11-24 Thread Yiming (John) Zhang
Thank you, Marcelo and Sean, mvn install is a good answer for my demands. -邮件原件- 发件人: Marcelo Vanzin [mailto:van...@cloudera.com] 发送时间: 2014年11月21日 1:47 收件人: yiming zhang 抄送: Sean Owen; user@spark.apache.org 主题: Re: How to incrementally compile spark examples using mvn Hi Yiming, On

Re: Spark serialization issues with third-party libraries

2014-11-24 Thread jatinpreet
Thanks Arush! Your example is nice and easy to understand. I am implementing it through Java though. Jatin - Novice Big Data Programmer -- View this message in context:

Store kmeans model

2014-11-24 Thread Jaonary Rabarisoa
Dear all, How can one save a kmeans model after training ? Best, Jao

Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
Hi, I want to submit my spark program from my machine on a YARN Cluster in yarn client mode. How to specify al l the required details through SPARK submitter. Please provide me some details. -Naveen.

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Akhil Das
You can export the hadoop configurations dir (export HADOOP_CONF_DIR=XXX) in the environment and then submit it like: ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ # can also be `yarn-client` for client mode --executor-memory 20G \

issue while running the code in standalone mode: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-24 Thread vdiwakar.malladi
Hi, When i trying to execute the program from my laptop by connecting to HDP environment (on which Spark also configured), i'm getting the warning (Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory) and Job is being

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Naveen Kumar Pokala
Hi Akhil, But driver and yarn both are in different networks, How to specify (export HADOOP_CONF_DIR=XXX) path. Like driver is from my windows machine and yarn is some unix machine on different network. -Naveen. From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Monday, November 24,

Re: Submit Spark driver on Yarn Cluster in client mode

2014-11-24 Thread Akhil Das
Not sure if it will work, but you can try creating a dummy hadoop conf directory and put those files (*-site.xml) files inside it and hopefully spark will pick it up and submit it on that remote cluster. (If there isn't any network/firewall issues). Thanks Best Regards On Mon, Nov 24, 2014 at

Re: issue while running the code in standalone mode: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-24 Thread Akhil Das
This can happen mainly because of the following: - Wrong master url (Make sure you give the master url which is listed on top left corner of the webui - running on 8080) - Allocated more memory/cores while creating the sparkContext. Thanks Best Regards On Mon, Nov 24, 2014 at 4:13 PM,

Use case question

2014-11-24 Thread Gordon Benjamin
hi, We are building an analytics dashboard. Data will be updated every 5 minutes for now and eventually every 1 minute, maybe more frequent. The amount of data coming is not huge, per customer maybe 30 records per minute although we could have 500 customers. Is streaming correct for this I nstead

Re: Use case question

2014-11-24 Thread Akhil Das
Streaming would be easy to implement, all you have to do is to create the stream, do some transformation (depends on your usecase) and finally write it to your dashboards backend. What kind of dashboards are you building? For d3.js based ones, you can have websocket and write the stream output to

Re: Use case question

2014-11-24 Thread Gordon Benjamin
Thanks. Yes d3 ones. Just to clarify--we could take our current system, which is incrementally adding partitions and overlay an Apache streaming layer to ingest these partitions? Then nightly, we could coalesce these partitions for example? I presume that while we are carrying out a coalesce, the

Re: issue while running the code in standalone mode: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-24 Thread Sean Owen
Wouldn't it likely be the opposite? Too much memory / too many cores being requested relative to the resource that YARN makes available? On Nov 24, 2014 11:00 AM, Akhil Das ak...@sigmoidanalytics.com wrote: This can happen mainly because of the following: - Wrong master url (Make sure you give

Re: issue while running the code in standalone mode: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-11-24 Thread vdiwakar.malladi
Thanks for your response. I gave correct master url. Moreover as i mentioned in my post, i could able to run the sample program by using spark-submit. But it is not working when i'm running from my machine. Any clue on this? Thanks in advance. -- View this message in context:

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread riginos
OK thank you very much for that! On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List] ml-node+s1001560n19598...@n3.nabble.com wrote: It sort of depends on your environment. If you are running on your local environment, I would just download the latest Spark 1.1 binaries and you'll be

Writing collection to file error

2014-11-24 Thread Saurabh Agrawal
import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.Rating // Load and parse the data val data = sc.textFile(/path/CFReady.txt) val ratings = data.map(_.split('\t') match { case Array(user, item, rate) = Rating(user.toInt, item.toInt, rate.toDouble)

Re: Use case question

2014-11-24 Thread Akhil Das
I'm not quiet sure if i understood you correctly, but here's the thing, if you use sparkstreaming, it is more likely to refresh your dashboard for each batch. So for every batch your dashboard will be updated with the new data. And yes, the end use won't feel anything while you do the

Re: Writing collection to file error

2014-11-24 Thread Akhil Das
Hi Saurabh, Here your ratesAndPreds is a RDD of type [((int, int), (Double, Double))] not an Array. Now, if you want to save it on disk, then you can simply call the saveAsTextFile and provide the location. So change your last line from this: ratesAndPreds.foreach(pw.println) to this

Re: EC2 cluster with SSD ebs

2014-11-24 Thread Hao Ren
Hi, I found that the ec2 script has been improved a lot. And the option ebs-vol-type is added to specify ebs type. However, it seems that the option does not work, the cmd I used is the following: $SPARK_HOME/ec2/spark-ec2 -k sparkcv -i spark.pem -m r3.4xlarge -s 3 -t r3.2xlarge

spark broadcast error

2014-11-24 Thread Ke Wang
I want to ran my spark program on a YARN cluster. But when I tested broadcast function in my program, I got an error. Exception in thread main org.apache.spark.SparkException: Error sending message as driverActor is null [message = UpdateBlockInfo(BlockManagerId(driver, in160-011.byted.org,

Re: Use case question

2014-11-24 Thread Gordon Benjamin
Great thanks On Monday, November 24, 2014, Akhil Das ak...@sigmoidanalytics.com wrote: I'm not quiet sure if i understood you correctly, but here's the thing, if you use sparkstreaming, it is more likely to refresh your dashboard for each batch. So for every batch your dashboard will be

ExternalAppendOnlyMap: Thread spilling in-memory map of to disk many times slowly

2014-11-24 Thread Romi Kuntsman
Hello, I have a large data calculation in Spark, distributed across serveral nodes. In the end, I want to write to a single output file. For this I do: output.coalesce(1, false).saveAsTextFile(filename). What happens is all the data from the workers flows to a single worker, and that one

Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to create a uber jar to submit to the stand alone master. I'm using the hadoop 1 prebuilt binaries for Spark. As soon as I try to do sc.CassandraTable(...) I get an error that's likely to be a Guava versioning

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-24 Thread Josh Mahonin
Hi Alaa Ali, That's right, when using the PhoenixInputFormat, you can do simple 'WHERE' clauses and then perform any aggregate functions you'd like from within Spark. Any aggregations you run won't be quite as fast as running the native Spark queries, but once it's available as an RDD you can

Re: Spark SQL Programming Guide - registerTempTable Error

2014-11-24 Thread Rishi Yadav
We keep conf as symbolic link so that upgrade is as simple as drop-in replacement On Monday, November 24, 2014, riginos samarasrigi...@gmail.com wrote: OK thank you very much for that! On 23 Nov 2014 21:49, Denny Lee [via Apache Spark User List] [hidden email]

RE: ClassNotFoundException in standalone mode

2014-11-24 Thread Benoit Pasquereau
I finally managed to get the example working, here are the details that may help other users. I have 2 windows nodes for the test system, PN01 and PN02. Both have the same shared drive S: (it is mapped to C:\source on PN02). If I run the worker and master from S:\spark-1.1.0-bin-hadoop2.4,

RE: Writing collection to file error

2014-11-24 Thread Saurabh Agrawal
Thanks for your help Akhil, however, this is creating an output folder and storing the result sets in multiple files. Also the record count in the result set seems to have multiplied!! Is there any other way to achieve this? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green

Spark SQL (1.0)

2014-11-24 Thread david
Hi, I build 2 tables from files. Table F1 join with table F2 on c5=d4. F1 has 46730613 rows F2 has 3386740 rows All keys d4 exists in F1.c5, so i expect to retrieve 46730613 rows. But it returns only 3437 rows // --- begin code --- val sqlContext = new

Re: Converting a column to a map

2014-11-24 Thread Yanbo
jsonFiles in your code is schemaRDD rather than RDD[Array]. If it is a column in schemaRDD, you can first use Spark SQL query to get a certain column. Or schemaRDD support some SQL like operation such as select / where can also get specific column. 在 2014年11月24日,上午4:01,Daniel Haviv

Re: Spark Cassandra Guava version issues

2014-11-24 Thread shahab
I faced same problem, and s work around solution is here : https://github.com/datastax/spark-cassandra-connector/issues/292 best, /Shahab On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab as...@live.com wrote: I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using sbt-assembly to

Spark and Stanford CoreNLP

2014-11-24 Thread tvas
Hello, I was wondering if anyone has gotten the Stanford CoreNLP Java library to work with Spark. My attempts to use the parser/annotator fail because of task serialization errors since the class StanfordCoreNLP cannot be serialized. I've tried the remedies of registering StanfordCoreNLP

Re: Writing collection to file error

2014-11-24 Thread Akhil Das
To get the results in a single file, you could do a repartition(1) and then save it. ratesAndPreds.repartition(1).saveAsTextFile(/path/CFOutput) Thanks Best Regards On Mon, Nov 24, 2014 at 8:32 PM, Saurabh Agrawal saurabh.agra...@markit.com wrote: Thanks for your help Akhil, however,

Re: MLLib: LinearRegressionWithSGD performance

2014-11-24 Thread Yanbo
From the metrics page, it reveals that only two executors work parallel for each iteration. You need to improve parallel threads numbers. Some tips maybe helpful: Increase spark.default.parallelism; Use repartition() or coalesce() to increase partition number. 在 2014年11月22日,上午3:18,Sameer

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan Sparks
We have gotten this to work, but it requires instantiating the CoreNLP object on the worker side. Because of the initialization time it makes a lot of sense to do this inside of a .mapPartitions instead of a .map, for example. As an aside, if you're using it from Scala, have a look at

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Ian O'Connell
object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks

Re: How to keep a local variable in each cluster?

2014-11-24 Thread Yanbo
发自我的 iPad 在 2014年11月24日,上午9:41,zh8788 78343...@qq.com 写道: Hi, I am new to spark. This is the first time I am posting here. Currently, I try to implement ADMM optimization algorithms for Lasso/SVM Then I come across a problem: Since the training data(label, feature) is large, so I

Connected Components running for a long time and failing eventually

2014-11-24 Thread nitinkak001
I am trying to run connected components on a graph generated by reading an edge file. Its running for a long time(3-4 hrs) and then eventually failing. Cant find any error in log file. The file I am testing it on has 27M rows(edges). Is there something obviously wrong with the code? I tested the

advantages of SparkSQL?

2014-11-24 Thread mrm
Hi, Is there any advantage to storing data as a parquet format, loading it using the sparkSQL context, but never registering as a table/using sql on it? Something like: Something like: data = sqc.parquetFile(path) results = data.map(lambda x: applyfunc(x.field)) Is this faster/more optimised

Mllib native netlib-java/OpenBLAS

2014-11-24 Thread agg212
Hi, i'm trying to improve performance for Spark's Mllib, and I am having trouble getting native netlib-java libraries installed/recognized by Spark. I am running on a single machine, Ubuntu 14.04 and here is what I've tried: sudo apt-get install libgfortran3 sudo apt-get install libatlas3-base

Re: advantages of SparkSQL?

2014-11-24 Thread Akshat Aranya
Parquet is a column-oriented format, which means that you need to read in less data from the file system if you're only interested in a subset of your columns. Also, Parquet pushes down selection predicates, which can eliminate needless deserialization of rows that don't match a selection

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Madabhattula Rajesh Kumar
Hello, I'm new to Stanford CoreNLP. Could any one share good training material and examples(java or scala) on NLP. Regards, Rajesh On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell i...@ianoconnell.com wrote: object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
This is probably not the right venue for general questions on CoreNLP - the project website (http://nlp.stanford.edu/software/corenlp.shtml) provides documentation and links to mailing lists/stack overflow topics. On Mon, Nov 24, 2014 at 9:08 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com

How does Spark SQL traverse the physical tree?

2014-11-24 Thread Tim Chou
Hi All, I'm learning the code of Spark SQL. I'm confused about how SchemaRDD executes each operator. I'm tracing the code. I found toRDD() function in QueryExecution is the start for running a query. toRDD function will run SparkPlan, which is a tree structure. However, I didn't find any

Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Hello guys, I'm using Spark 1.0.0 and Kryo serialization In the Spark Shell, when I create a class that contains as an attribute the SparkContext, in this way: class AAA(val s: SparkContext) { } val aaa = new AAA(sc) and I execute any action using that attribute like: val myNumber = 5

Spark error in execution

2014-11-24 Thread Blackeye
I created an application in spark. When I run it with spark, everything works fine. But when I export my application with the libraries (via sbt), and trying to run it as an executable jar, I get the following error: 14/11/24 20:06:11 ERROR OneForOneStrategy: exception during creation

Re: Spark streaming job failing after some time.

2014-11-24 Thread pankaj channe
I have figured out the problem here. Turned out that there was a problem with my SparkConf when I was running my application with yarn in cluster mode. I was setting my master to be local[4] inside my application, whereas I was setting it to yarn-cluster with spark-submit. Now I have changed my

Python Scientific Libraries in Spark

2014-11-24 Thread Rohit Pujari
Hello Folks: Since spark exposes python bindings and allows you to express your logic in Python, Is there a way to leverage some of the sophisticated libraries like NumPy, SciPy, Scikit in spark job and run at scale? What's been your experience, any insights you can share in terms of what's

RE: Spark Cassandra Guava version issues

2014-11-24 Thread Ashic Mahtab
Did the workaround work for you? Doesn't seem to work for me. Date: Mon, 24 Nov 2014 16:44:17 +0100 Subject: Re: Spark Cassandra Guava version issues From: shahab.mok...@gmail.com To: as...@live.com CC: user@spark.apache.org I faced same problem, and s work around solution is here :

Re: Python Logistic Regression error

2014-11-24 Thread Xiangrui Meng
The data is in LIBSVM format. So this line won't work: values = [float(s) for s in line.split(' ')] Please use the util function in MLUtils to load it as an RDD of LabeledPoint. http://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point from pyspark.mllib.util import MLUtils

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Xiangrui Meng
Try building Spark with -Pnetlib-lgpl, which includes the JNI library in the Spark assembly jar. This is the simplest approach. If you want to include it as part of your project, make sure the library is inside the assembly jar or you specify it via `--jars` with spark-submit. -Xiangrui On Mon,

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
Additionally - I strongly recommend using OpenBLAS over the Atlas build from the default Ubuntu repositories. Alternatively, you can build ATLAS on the hardware you're actually going to be running the matrix ops on (the master/workers), but we've seen modest performance gains doing this vs.

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Marcelo Vanzin wrote Do you expect to be able to use the spark context on the remote task? Not At all, what I want to create is a wrapper of the SparkContext, to be used only on the driver node. I would like to have in this AAA wrapper several attributes, such as the SparkContext and other

Unable to use Kryo

2014-11-24 Thread Daniel Haviv
Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error: java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking it up. Thanks for your help, Daniel

Re: How does Spark SQL traverse the physical tree?

2014-11-24 Thread Michael Armbrust
You are pretty close. The QueryExecution is what drives the phases from parsing to execution. Once we have a final SparkPlan (the physical plan), toRdd just calls execute() which recursively calls execute() on children until we hit a leaf operator. This gives us an RDD[Row] that will compute

Re: advantages of SparkSQL?

2014-11-24 Thread Michael Armbrust
Akshat is correct about the benefits of parquet as a columnar format, but I'll add that some of this is lost if you just use a lambda function to process the data. Since your lambda function is a black box Spark SQL does not know which columns it is going to use and thus will do a full tablescan.

Re: How can I read this avro file using spark scala?

2014-11-24 Thread Michael Armbrust
Thanks for the feedback, I filed a couple of issues: https://github.com/databricks/spark-avro/issues On Fri, Nov 21, 2014 at 5:04 AM, thomas j beanb...@googlemail.com wrote: I've been able to load a different avro file based on GenericRecord with: val person =

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
Hello, On Mon, Nov 24, 2014 at 12:07 PM, aecc alessandroa...@gmail.com wrote: This is the stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA - field (class

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-24 Thread Michael Armbrust
Can you give the full stack trace. You might be hitting: https://issues.apache.org/jira/browse/SPARK-4293 On Sun, Nov 23, 2014 at 3:00 PM, critikaled isasmani@gmail.com wrote: Hi, I am trying to insert particular set of data from rdd to a hive table I have Map[String,Map[String,Int]] in

Re: Merging Parquet Files

2014-11-24 Thread Michael Armbrust
Parquet does a lot of serial metadata operations on the driver which makes it really slow when you have a very large number of files (especially if you are reading from something like S3). This is something we are aware of and that I'd really like to improve in 1.3. You might try the (brand new

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Shivaram Venkataraman
Can you clarify what is the Spark master URL you are using ? Is it 'local' or is it a cluster ? If it is 'local' then rebuilding Spark wouldn't help as Spark is getting pulled in from Maven and that'll just pick up the released artifacts. Shivaram On Mon, Nov 24, 2014 at 1:08 PM, agg212

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Yes, I'm running this in the Shell. In my compiled Jar it works perfectly, the issue is I need to do this on the shell. Any available workarounds? I checked sqlContext, they use it in the same way I would like to use my class, they make the class Serializable with transient. Does this affects

Re: Spark S3 Performance

2014-11-24 Thread Nitay Joffe
Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It just seems from the logging and the timing that it is doing a get of the entire file, then figures out it only needs some certain blocks, does another get of only the specific block. Regarding # partitions - I think I

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
On Mon, Nov 24, 2014 at 1:56 PM, aecc alessandroa...@gmail.com wrote: I checked sqlContext, they use it in the same way I would like to use my class, they make the class Serializable with transient. Does this affects somehow the whole pipeline of data moving? I mean, will I get performance

Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-24 Thread Natu Lauchande
Hi all, I am getting started with Spark. I would like to use for a spike on anomaly detection in a massive stream of metrics. Can Spark easily handle this use case ? Thanks, Natu

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread aecc
Ok, great, I'm gonna do do it that way, thanks :). However I still don't understand why this object should be serialized and shipped? aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881 However this : aaa.s.parallelize(1 to 10).filter(_ == myNumber).count Needs to be

Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using network monitoring stats? If it does, that would be a bug in my opinion. On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote: Andrei, Ashish, To be clear, I don't think it's *counting* the entire file. It

Spark SQL - Any time line to move beyond Alpha version ?

2014-11-24 Thread Manoj Samel
Is there any timeline where Spark SQL goes beyond alpha version? Thanks,

Re: Using Spark Context as an attribute of a class cannot be used

2014-11-24 Thread Marcelo Vanzin
That's an interesting question for which I do not know the answer. Probably a question for someone with more knowledge of the internals of the shell interpreter... On Mon, Nov 24, 2014 at 2:19 PM, aecc alessandroa...@gmail.com wrote: Ok, great, I'm gonna do do it that way, thanks :). However I

Re: Python Scientific Libraries in Spark

2014-11-24 Thread Davies Liu
These libraries could be used in PySpark easily. For example, MLlib uses Numpy heavily, it can accept np.array or sparse matrix in SciPy as vectors. On Mon, Nov 24, 2014 at 10:56 AM, Rohit Pujari rpuj...@hortonworks.com wrote: Hello Folks: Since spark exposes python bindings and allows you to

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Evan R. Sparks
Neat hack! This is cute and actually seems to work. The fact that it works is a little surprising and somewhat unintuitive. On Mon, Nov 24, 2014 at 8:08 AM, Ian O'Connell i...@ianoconnell.com wrote: object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it

Is spark streaming +MlLib for online learning?

2014-11-24 Thread Joanne Contact
Hi Gurus, Sorry for my naive question. I am new. I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. I could not find this on the website now. http://spark.apache.org/docs/latest/streaming-programming-guide.html I know MLLib uses

Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Tobias Pfeiffer
Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com wrote: I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning. Spark doesn't do Machine Learning itself, but MLlib does. MLlib currently can do online learning

Re: Setup Remote HDFS for Spark

2014-11-24 Thread Tobias Pfeiffer
Hi, On Sat, Nov 22, 2014 at 12:13 AM, EH eas...@gmail.com wrote: Unfortunately whether it is possible to have both Spark and HDFS running on the same machine is not under our control. :( Right now we have Spark and HDFS running in different machines. In this case, is it still possible to

Re: Is spark streaming +MlLib for online learning?

2014-11-24 Thread Joanne Contact
Thank you Tobias! On Mon, Nov 24, 2014 at 5:13 PM, Tobias Pfeiffer t...@preferred.jp wrote: Hi, On Tue, Nov 25, 2014 at 9:40 AM, Joanne Contact joannenetw...@gmail.com wrote: I seemed to read somewhere that spark is still batch learning, but spark streaming could allow online learning.

Negative Accumulators

2014-11-24 Thread Peter Thai
Hello! Does anyone know why I may be receiving negative final accumulator values? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Negative-Accumulators-tp19706.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread agg212
I am running it in local. How can I use the built version (in local mode) so that I can use the native libraries? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-native-netlib-java-OpenBLAS-tp19662p19705.html Sent from the Apache Spark User List

Spark performance optimization examples

2014-11-24 Thread SK
Hi, Is there any document that provides some guidelines with some examples that illustrate when different performance optimizations would be useful? I am interested in knowing the guidelines for using optimizations like cache(), persist(), repartition(), coalesce(), and broadcast variables. I

Spark saveAsText file size

2014-11-24 Thread Alan Prando
Hi Folks! I'm running a spark JOB on a cluster with 9 slaves and 1 master (250GB RAM, 32 cores each and 1TB of storage each). This job generates 1.200 TB of data on a RDD with 1200 partitions. When I call saveAsTextFile(hdfs://...), spark creates 1200 files named part-000* on HDFS's folder.

Re: Negative Accumulators

2014-11-24 Thread Shixiong Zhu
int overflow? If so, you can use BigInt like this: scala import org.apache.spark.AccumulatorParamimport org.apache.spark.AccumulatorParam scala :paste// Entering paste mode (ctrl-D to finish) implicit object BigIntAccumulatorParam extends AccumulatorParam[BigInt] { def addInPlace(t1: BigInt,

Is Spark? or GraphX runs fast? a performance comparison on Page Rank

2014-11-24 Thread Harihar Nahak
Hi All, I started exploring Spark from past 2 months. I'm looking for some concrete features from both Spark and GraphX so that I'll take some decisions what to use, based upon who get highest performance. According to documentation GraphX runs 10x faster than normal Spark. So I run Page Rank

Re: Negative Accumulators

2014-11-24 Thread Peter Thai
Great! Worked like a charm :) On Mon, Nov 24, 2014 at 9:56 PM, Shixiong Zhu zsxw...@gmail.com wrote: int overflow? If so, you can use BigInt like this: scala import org.apache.spark.AccumulatorParamimport org.apache.spark.AccumulatorParam scala :paste// Entering paste mode (ctrl-D to

Re: Mllib native netlib-java/OpenBLAS

2014-11-24 Thread Evan R. Sparks
You can try recompiling spark with that option, and doing an sbt/sbt publish-local, then change your spark version from 1.1.0 to 1.2.0-SNAPSHOT (assuming you're building from the 1.1 branch) - sbt or maven (whichever you're compiling your app with) will pick up the version of spark that you just

Re: Spark saveAsText file size

2014-11-24 Thread Yanbo Liang
In memory cache may be blow up the size of RDD. It's general condition that RDD will take more space in memory than disk. There are options to configure and optimize storage space efficiency in Spark, take a look at this https://spark.apache.org/docs/latest/tuning.html 2014-11-25 10:38 GMT+08:00

Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Xuelin Cao
Hi,      I'm going to debug some spark applications on our testing platform. And it would be helpful if we can see the eventLog on the worker node.      I've tried to turn on spark.eventLog.enabled and set spark.eventLog.dir  parameters on the worker node. However, it doesn't work.      I do

Re: Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Marcelo Vanzin
Hello, What exactly are you trying to see? Workers don't generate any events that would be logged by enabling that config option. Workers generate logs, and those are captured and saved to disk by the cluster manager, generally, without you having to do anything. On Mon, Nov 24, 2014 at 7:46 PM,

Re: Is there a way to turn on spark eventLog on the worker node?

2014-11-24 Thread Harihar Nahak
You can set the same parameter when launching an application, if you use sppar-submit tried --conf to give those variables or from SparkConfig also you can set the logs for both driver and workers. - --Harihar -- View this message in context:

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Cheng Lian
Hm, I tried exactly the same commit and the build command locally, but couldn’t reproduce this. Usually this kind of errors are caused by classpath misconfiguration. Could you please try this to ensure corresponding Guava classes are included in the assembly jar you built? |jar tf

Control number of parquet generated from JavaSchemaRDD

2014-11-24 Thread tridib
Hello, I am reading around 1000 input files from disk in an RDD and generating parquet. It always produces same number of parquet files as number of input files. I tried to merge them using rdd.coalesce(n) and/or rdd.repatition(n). also tried using: int MB_128 = 128*1024*1024;

How to access application name in the spark framework code.

2014-11-24 Thread rapelly kartheek
Hi, When I submit a spark application like this: ./bin/spark-submit --class org.apache.spark.examples.SparkKMeans --deploy-mode client --master spark://karthik:7077 $SPARK_HOME/examples/*/scala-*/spark-examples-*.jar /k-means 4 0.001 Which part of the spark framework code deals with the name of

Re: advantages of SparkSQL?

2014-11-24 Thread Cheng Lian
For the “never register a table” part, actually you /can/ use Spark SQL without registering a table via its DSL. Say you’re going to extract an |Int| field named |key| from the table and double it: |import org.apache.sql.catalyst.dsl._ val data = sqc.parquetFile(path) val double =

Re: How to access application name in the spark framework code.

2014-11-24 Thread Deng Ching-Mallete
Hi, I think it should be accessible via the SparkConf in the SparkContext. Something like sc.getConf().get(spark.app.name)? Thanks, Deng On Tue, Nov 25, 2014 at 12:40 PM, rapelly kartheek kartheek.m...@gmail.com wrote: Hi, When I submit a spark application like this: ./bin/spark-submit

Edge List File in GraphX

2014-11-24 Thread Deep Pradhan
Hi, Is it necessary for every vertex to have an attribute when we load a graph to GraphX? In other words, if I have an edge list file containing pairs of vertices i.e., 1 2 means that there is an edge between node 1 and node 2. Now, when I run PageRank on this data it return a NaN. Can I use

Re: New Codes in GraphX

2014-11-24 Thread Deep Pradhan
Could it be because my edge list file is in the form (1 2), where there is an edge between node 1 and node 2? On Tue, Nov 18, 2014 at 4:13 PM, Ankur Dave ankurd...@gmail.com wrote: At 2014-11-18 15:51:52 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: Yes the above command works,

Re: Spark performance optimization examples

2014-11-24 Thread Akhil Das
Here's the tuning guidelines if you haven't seen it already. http://spark.apache.org/docs/latest/tuning.html You could try the following to get it loaded: - Use kryo Serialization http://spark.apache.org/docs/latest/tuning.html#data-serialization - Enable RDD Compression - Set Storage level to

Re: How to insert complex types like mapstring,mapstring,int in spark sql

2014-11-24 Thread critikaled
Thanks for the reply Micheal here is the stack trace org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 1 times, most recent failure: Lost task 3.0 in stage 0.0 (TID 3, localhost): scala.MatchError: MapType(StringType,StringType,true) (of class

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-24 Thread Judy Nash
This is what I got from jar tf: org/spark-project/guava/common/base/Preconditions.class org/spark-project/guava/common/math/MathPreconditions.class com/clearspring/analytics/util/Preconditions.class parquet/Preconditions.class I seem to have the line that reported missing, but I am missing this

  1   2   >