Re: K means clustering in spark

2015-12-31 Thread Yanbo Liang
Hi Anjali, The main output of KMeansModel is clusterCenters which is Array[Vector]. It has k elements where k is the number of clusters and each elements is the center of the specified cluster. Yanbo 2015-12-31 12:52 GMT+08:00 : > Hi, > > I am trying to use kmeans

Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Hi all, I'm sure this must have been solved already, but I can't see anything obvious. Using Spark Streaming, I'm trying to execute a transform function on a DStream at short batch intervals (e.g. 1 second), but only write the resulting data to disk using saveAsTextFiles in a larger batch

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Steve Loughran
> On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev > wrote: > > Hi Jerry, > > I want to run different jobs on different S3 buckets - different AWS creds - > on the same instances. Could you shed some light if it's possible to achieve > with hdfs-site? > >

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou
Thanks, Yanbo. The results become much more reasonable, after I set driver memory to 5GB and increase worker memory to 25GB. So, my question is for following code snippet extracted from main method in JavaKMeans.java in examples, what will the driver do? and what will the worker do? I didn't

Re: Help me! Spark WebUI is corrupted!

2015-12-31 Thread Aniket Bhatnagar
Are you running on YARN or standalone? On Thu, Dec 31, 2015, 3:35 PM LinChen wrote: > *Screenshot1(Normal WebUI)* > > > > *Screenshot2(Corrupted WebUI)* > > > > As screenshot2 shows, the format of my Spark WebUI looks strange and I > cannot click the description of active

what is the proper number set about --num-executors etc

2015-12-31 Thread Zhiliang Zhu
In order to make job run faster, some parameters would be specified in the command lines, such as --executor-cores , --executor-memory and --num-executors ... However, as tested, it seemed that those numbers would not be reset randomly, or some trouble would be caused for the cluster.What is

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ewan Leith
Yeah it's awkward, the transforms being done are fairly time sensitive, so I don't want them to wait 60 seconds or more. I might have to move the code from a transform into a custom receiver instead, so they'll be processed outside the window length. A buffered writer is a good idea too,

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Steve Loughran
> On 30 Dec 2015, at 13:19, alvarobrandon wrote: > > Hello: > > Is there anyway of monitoring the number of Bytes or blocks read and written > by an Spark application?. I'm running Spark with YARN and I want to measure > how I/O intensive a set of applications are.

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread Brian London
Since you're running in standalone mode, can you try it using Spark 1.5.1 please? On Thu, Dec 31, 2015 at 9:09 AM Steve Loughran wrote: > > > On 30 Dec 2015, at 19:31, KOSTIANTYN Kudriavtsev < > kudryavtsev.konstan...@gmail.com> wrote: > > > > Hi Jerry, > > > > I want to

Help me! Spark WebUI is corrupted!

2015-12-31 Thread LinChen
Screenshot1(Normal WebUI) Screenshot2(Corrupted WebUI) As screenshot2 shows, the format of my Spark WebUI looks strange and I cannot click the description of active jobs. It seems there is something missing in my opearing system. I googled it but find nothing. Could anybody help me?

RE: Batch together RDDs for Streaming output, without delaying execution of map or transform functions

2015-12-31 Thread Ashic Mahtab
Hi Ewan,Transforms are definitions of what needs to be done - they don't execute until and action is triggered. For what you want, I think you might need to have an action that writes out rdds to some sort of buffered writer. -Ashic. From: ewan.le...@realitymine.com To: user@spark.apache.org

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi Jerry, thanks for the hint, could you please more specific how can I pass different spark-{usr}.conf per user during job submit and which propery I can use to specify custom hdfs-site.xml? I tried to google, but didn't find nothing Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at

pass custom spark-conf

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi all, I'm trying to use different spark-default.conf per user, i.e. I want to have spark-user1.conf and etc. Is it a way to pass a path to appropriate conf file when I'm using standalone spark installation? Also, is it possible to configure different hdfs-site.xml and pass it as well with

Re: efficient checking the existence of an item in a rdd

2015-12-31 Thread domibd
thanks a lot. It is very interesting. Unfortunatly it does not solve my very simple problem : efficiently find whether a value is in a huge rdd. thanks again Dominique Le 31/12/2015 01:26, madaan.amanmadaan [via Apache Spark User List] a écrit : > Hi, > > Check out

Re: efficient checking the existence of an item in a rdd

2015-12-31 Thread Nick Peterson
The key to efficient lookups is having a partitioner in place. If you don't have a partitioner in place, essentially the best you can do is: def contains[T](rdd: RDD[T], value: T): Boolean = ! (rdd.filter(x => x == value).isEmpty) If you are going to do this sort of operation frequently, it

Re: Apparent bug in KryoSerializer

2015-12-31 Thread Ted Yu
For your second question, bq. Class is not registered: scala.Tuple3[] The above IllegalArgumentException has stated the class Scala was expecting registration. Meaning the type of components in the tuple is insignificant. BTW what Spark release are you using ? Cheers On Thu, Dec 31, 2015 at

Re: SparkSQL integration issue with AWS S3a

2015-12-31 Thread KOSTIANTYN Kudriavtsev
Hi Jerry, what you suggested looks to be working (I put hdfs-site.xml into $SPARK_HOME/conf folder), but could you shed some light on how it can be federated per user? Thanks in advance! Thank you, Konstantin Kudryavtsev On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam wrote:

Problem embedding GaussianMixtureModel in a closure

2015-12-31 Thread Tomasz Fruboes
Dear All, I'm trying to implement a procedure that iteratively updates a rdd using results from GaussianMixtureModel.predictSoft. In order to avoid problems with local variable (the obtained GMM) beeing overwritten in each pass of the loop I'm doing the following:

Re: pass custom spark-conf

2015-12-31 Thread KOSTIANTYN Kudriavtsev
I want to add AWS credentials into hdfs-site.xml and pass different xml for different users Thank you, Konstantin Kudryavtsev On Thu, Dec 31, 2015 at 2:19 PM, Ted Yu wrote: > Check out --conf option for spark-submit > > bq. to configure different hdfs-site.xml > > What

Re: Monitoring Spark HDFS Reads and Writes

2015-12-31 Thread Arkadiusz Bicz
Hello, Spark collect HDFS read/write metrics per application/job see details http://spark.apache.org/docs/latest/monitoring.html. I have connected spark metrics to Graphite and then doing nice graphs display on Graphana. BR, Arek On Thu, Dec 31, 2015 at 2:00 PM, Steve Loughran

Re: pass custom spark-conf

2015-12-31 Thread Ted Yu
Check out --conf option for spark-submit bq. to configure different hdfs-site.xml What config parameters do you plan to change in hdfs-site.xml ? If the parameter only affects hdfs NN / DN, passing hdfs-site.xml wouldn't take effect, right ? Cheers On Thu, Dec 31, 2015 at 10:48 AM, KOSTIANTYN

Apparent bug in KryoSerializer

2015-12-31 Thread Russ
The ScalaTest code that is enclosed at the end of this email message demonstrates what appears to be a bug in the KryoSerializer.  This code was executed from IntelliJ IDEA (community edition) under Mac OS X 10.11.2 The KryoSerializer is enabled by updating the original SparkContext  (that is

does HashingTF maintain a inverse index?

2015-12-31 Thread Andy Davidson
Hi I am working on proof of concept. I am trying to use spark to classify some documents. I am using tokenizer and hashingTF to convert the documents into vectors. Is there any easy way to map feature back to words or do I need to maintain the reverse index my self? I realize there is a chance

Re: does HashingTF maintain a inverse index?

2015-12-31 Thread Hayri Volkan Agun
Hi, If you are using pipeline api, you do not need to map features back to documents. Your input (which is the document text) won't change after you used HashingTF. If you want to do Information Retrieval with spark, I suggest you to use not the pipeline but RDDs... On Fri, Jan 1, 2016 at 2:20