Re: Using DIMSUM with ids

2015-04-06 Thread Reza Zadeh
Right now dimsum is meant to be used for tall and skinny matrices, and so columnSimilarities() returns similar columns, not rows. We are working on adding an efficient row similarity as well, tracked by this JIRA: https://issues.apache.org/jira/browse/SPARK-4823 Reza On Mon, Apr 6, 2015 at 6:08

Re: Spark Avarage

2015-04-06 Thread baris akgun
Thanks for your replies I solved the problem with this code val weathersRDD = sc.textFile(csvfilePath).map { line = val Array(dayOfdate, minDeg, maxDeg, meanDeg) = line.replaceAll(\,).trim.split(,) Tuple2(dayOfdate.substring(0,7), (minDeg.toInt, maxDeg.toInt, meanDeg.toInt))

Re: Spark unit test fails

2015-04-06 Thread Manas Kar
Trying to bump up the rank of the question. Any example on Github can someone point to? ..Manas On Fri, Apr 3, 2015 at 9:39 AM, manasdebashiskar manasdebashis...@gmail.com wrote: Hi experts, I am trying to write unit tests for my spark application which fails with

Re: From DataFrame to LabeledPoint

2015-04-06 Thread Joseph Bradley
I'd make sure you're selecting the correct columns. If not that, then your input data might be corrupt. CCing user to keep it on the user list. On Mon, Apr 6, 2015 at 6:53 AM, Sergio Jiménez Barrio drarse.a...@gmail.com wrote: Hi!, I had tried your solution, and I saw that the first row is

RE: Spark Avarage

2015-04-06 Thread Cheng, Hao
The Dataframe API should be perfectly helpful in this case. https://spark.apache.org/docs/1.3.0/sql-programming-guide.html Some code snippet will like: val sqlContext = new org.apache.spark.sql.SQLContext(sc) // this is used to implicitly convert an RDD to a DataFrame. import

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
Please attach the full stack trace. -Xiangrui On Mon, Apr 6, 2015 at 12:06 PM, Jay Katukuri jkatuk...@apple.com wrote: Hi all, I got a runtime error while running the ALS. Exception in thread main java.lang.NoSuchMethodError:

How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
I have created a Custom Receiver to fetch records pertaining to a specific query from Elastic Search and have implemented Streaming RDD transformations to process the data generated by the receiver. The final RDD is a sorted list of name value pairs and I want to read the top 20 results

Re: Spark + Kinesis

2015-04-06 Thread Tathagata Das
Cc'ing Chris Fregly, who wrote the Kinesis integration. Maybe he can help. On Mon, Apr 6, 2015 at 9:23 AM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: Hi all, I am wondering, has anyone on this list been able to successfully implement Spark on top of Kinesis? Best, Vadim ᐧ On

Re: WordCount example

2015-04-06 Thread Tathagata Das
There are no workers registered with the Spark Standalone master! That is the crux of the problem. :) Follow the instructions properly - https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts Especially make the conf/slaves file has intended workers listed. TD On Mon,

Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Todd Nist
In 1.2.1 of I was persisting a set of parquet files as a table for use by spark-sql cli later on. There was a post here http://apache-spark-user-list.1001560.n3.nabble.com/persist-table-schema-in-spark-sql-tt16297.html#a16311 by Mchael Armbrust that provide a nice little helper method for dealing

task not serialize

2015-04-06 Thread Jeetendra Gangele
In this code in foreach I am getting task not serialized exception @SuppressWarnings(serial) public static void matchAndMerge(JavaRDDVendorRecord matchRdd, final JavaSparkContext jsc) throws IOException{ log.info(Company matcher started); //final JavaSparkContext jsc = getSparkContext();

Re: DataFrame groupBy MapType

2015-04-06 Thread Michael Armbrust
In HiveQL, you should be able to express this as: SELECT ... FROM table GROUP BY m['SomeKey'] On Sat, Apr 4, 2015 at 5:25 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I have a case class like this: case class A( m: Map[Long, Long], ... ) and constructed a DataFrame from

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
Thanks. I’ll look into it. But the JSON string I push via receiver goes through a series of transformations, before it ends up in the final RDD. I need to take care to ensure that this magic value propagates all the way down to the last one that I’m iterating on. Currently, I’m calling “stop

Re: DataFrame groupBy MapType

2015-04-06 Thread Michael Armbrust
I'll add that I don't think there is a convenient way to do this in the Column API ATM, but would welcome a JIRA for adding it :) On Mon, Apr 6, 2015 at 1:45 PM, Michael Armbrust mich...@databricks.com wrote: In HiveQL, you should be able to express this as: SELECT ... FROM table GROUP BY

java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Joanne Contact
Hello Sparkers, I kept getting this error: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint I have tried the following to convert v._1 to double: Method 1: (if(v._10) 1d else 0d) Method 2: def bool2Double(b:Boolean): Double = { if

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Michael Malak
You could have your receiver send a magic value when it is done. I discuss this Spark Streaming pattern in my presentation Spark Gotchas and Anti-Patterns. In the PDF version, it's slides 34-36.http://www.datascienceassn.org/content/2014-11-05-spark-gotchas-and-anti-patterns-julia-language

Re: Strategy regarding maximum number of executor's failure for log running jobs/ spark streaming jobs

2015-04-06 Thread Sandy Ryza
What's the advantage of killing an application for lack of resources? I think the rationale behind killing an app based on executor failures is that, if we see a lot of them in a short span of time, it means there's probably something going wrong in the app or on the cluster. On Wed, Apr 1, 2015

Re: Spark SQL Parquet as External table - 1.3.x HiveMetastoreType now hidden

2015-04-06 Thread Michael Armbrust
Hey Todd, In migrating to 1.3.x I see that the spark.sql.hive.convertMetastoreParquet is no longer public, so the above no longer works. This was probably just a typo, but to be clear, spark.sql.hive.convertMetastoreParquet is still a supported option and should work. You are correct that

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
Here is the command that I have used : spark-submit —class packagename.ALSNew --num-executors 100 --master yarn ALSNew.jar -jar spark-sql_2.11-1.3.0.jar hdfs://input_path Btw - I could run the old ALS in mllib package. On Apr 6, 2015, at 12:32 PM, Xiangrui Meng men...@gmail.com wrote:

Re: java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2015-04-06 Thread Jeetendra Gangele
I hit again same issue This time I tried to return the Object it failed with task not serialized below is the code here vendor record is serializable private static JavaRDDVendorRecord getVendorDataToProcess(JavaSparkContext sc) throws IOException { return sc

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Hari Polisetty
Yes, I’m using updateStateByKey and it works. But then I need to perform further computation on this Stateful RDD (see code snippet below). I perform forEach on the final RDD and get the top 10 records. I just don’t want the foreach to be performed every time a new batch is received. Only when

Re: java.lang.ClassCastException: scala.Tuple2 cannot be cast to org.apache.spark.mllib.regression.LabeledPoint

2015-04-06 Thread Xiangrui Meng
Did you try to treat RDD[(Double, Vector)] as RDD[LabeledPoint]? If that is the case, you need to cast them explicitly: rdd.map { case (label, features) = LabeledPoint(label, features) } -Xiangrui On Mon, Apr 6, 2015 at 11:59 AM, Joanne Contact joannenetw...@gmail.com wrote: Hello Sparkers,

org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
Hi all, I got a runtime error while running the ALS. Exception in thread main java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; The error that I am getting is at the following code: val ratings =

Re: DataFrame -- help with encoding factor variables

2015-04-06 Thread Xiangrui Meng
Before OneHotEncoder or LabelIndexer is merged, you can define an UDF to do the mapping. val labelToIndex = udf { ... } featureDF.withColumn(f3_dummy, labelToIndex(col(f3))) See instructions here

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
Hi, Here is the stack trace: Exception in thread main java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror; at ALSNew$.main(ALSNew.scala:35) at ALSNew.main(ALSNew.scala) at

Re: How to work with sparse data in Python?

2015-04-06 Thread Xiangrui Meng
We support sparse vectors in MLlib, which recognizes MLlib's sparse vector and SciPy's csc_matrix with a single column. You can create RDD of sparse vectors for your data and save/load them to/from parquet format using dataframes. Sparse matrix supported will be added in 1.4. -Xiangrui On Mon,

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Xiangrui Meng
So ALSNew.scala is your own application, did you add it with spark-submit or spark-shell? The correct command should like spark-submit --class your.package.name.ALSNew ALSNew.jar [options] Please check the documentation: http://spark.apache.org/docs/latest/submitting-applications.html -Xiangrui

Re: How to restrict foreach on a streaming RDD only once upon receiver completion

2015-04-06 Thread Tathagata Das
So you want to sort based on the total count of the all the records received through receiver? In that case, you have to combine all the counts using updateStateByKey (

Re: task not serialize

2015-04-06 Thread Jeetendra Gangele
Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable i have a requirement where I will get JavaRDDVendorRecord matchRdd and I need to return the postential matches for

Re: task not serialize

2015-04-06 Thread Dean Wampler
On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext not possible since its not serializable ​That's right. RDD don't nest

Super slow caching in 1.3?

2015-04-06 Thread Christian Perez
Hi all, Has anyone else noticed very slow time to cache a Parquet file? It takes 14 s per 235 MB (1 block) uncompressed node local Parquet file on M2 EC2 instances. Or are my expectations way off... Cheers, Christian -- Christian Perez Silicon Valley Data Science Data Analyst

Re: Spark Druid integration

2015-04-06 Thread Michael Armbrust
You could certainly build a connector, but it seems like you would want support for pushing down aggregations to get the benefits of Druid. There are only experimental interfaces for doing so today, but it sounds like a pretty cool project. On Mon, Apr 6, 2015 at 2:23 PM, Paolo Platter

Processing Large Images in Spark?

2015-04-06 Thread Patrick Young
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based

Re: Spark SQL code generation

2015-04-06 Thread Akshat Aranya
Thanks for the info, Michael. Is there a reason to do so, as opposed to shipping out the bytecode and loading it via the classloader? Is it more complex? I can imagine caching to be effective for repeated queries, but when the subsequent queries are different. On Mon, Apr 6, 2015 at 2:41 PM,

SparkSQL + Parquet performance

2015-04-06 Thread Paolo Platter
Hi all, is there anyone using SparkSQL + Parquet that has made a benchmark about storing parquet files on HDFS or on CFS ( Cassandra File System )? What storage can improve performance of SparkSQL+ Parquet ? Thanks Paolo

Spark SQL code generation

2015-04-06 Thread Akshat Aranya
Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw that an expression is parsed and compiled into a class using Scala reflection toolbox. However, it's unclear to me whether the actual byte code is generated on the master or on each of the

Re: task not serialize

2015-04-06 Thread Jeetendra Gangele
On 7 April 2015 at 04:03, Dean Wampler deanwamp...@gmail.com wrote: On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks a lot.That means Spark does not support the nested RDD? if I pass the javaSparkContext that also wont work. I mean passing SparkContext

Re: Spark SQL code generation

2015-04-06 Thread Michael Armbrust
The compilation happens in parallel on all of the machines, so its not really clear that there is a win to generating it on the driver and shipping it from a latency perspective. However, really I just took the easiest path that didn't require more bytecode extracting / shipping machinery. On

Spark Druid integration

2015-04-06 Thread Paolo Platter
Hi, Do you think it is possible to build an integration beetween druid and spark, using Datasource API ? Is someone investigating this kind of solution ? I think that Spark SQL could fill the lack of a complete SQL Layer of Druid. It could be a great OLAP solution. WDYT ? Paolo Platter

Re: Spark SQL code generation

2015-04-06 Thread Michael Armbrust
It is generated and cached on each of the executors. On Mon, Apr 6, 2015 at 2:32 PM, Akshat Aranya aara...@gmail.com wrote: Hi, I'm curious as to how Spark does code generation for SQL queries. Following through the code, I saw that an expression is parsed and compiled into a class using

Re: task not serialize

2015-04-06 Thread Dean Wampler
The log instance won't be serializable, because it will have a file handle to write to. Try defining another static method outside matchAndMerge that encapsulates the call to log.error. CompanyMatcherHelper might not be serializable either, but you didn't provide it. If it holds a database

Re: Super slow caching in 1.3?

2015-04-06 Thread Michael Armbrust
Do you think you are seeing a regression from 1.2? Also, are you caching nested data or flat rows? The in-memory caching is not really designed for nested data and so performs pretty slowly here (its just falling back to kryo and even then there are some locking issues). If so, would it be

RE: Tableau + Spark SQL Thrift Server + Cassandra

2015-04-06 Thread Mohammed Guller
Sure, will do. I may not be able to get to it until next week, but will let you know if I am able to the crack the code. Mohammed From: Todd Nist [mailto:tsind...@gmail.com] Sent: Friday, April 3, 2015 5:52 PM To: Mohammed Guller Cc: pawan kumar; user@spark.apache.org Subject: Re: Tableau +

Processing Large Images in Spark?

2015-04-06 Thread patrick.mckendree.young
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based

Seeing message about receiver not being de-registered on invoking Streaming context stop

2015-04-06 Thread Hari Polisetty
 My application is running Spark in local mode and  I have a Spark Streaming Listener as well as a Custom Receiver. When the receiver is done fetching all documents, it invokes “stop” on itself. I see the StreamingListener  getting a callback on “onReceiverStopped” where I stop the streaming

Re: A problem with Spark 1.3 artifacts

2015-04-06 Thread Josh Rosen
My hunch is that this behavior was introduced by a patch to start shading Jetty in Spark 1.3: https://issues.apache.org/jira/browse/SPARK-3996. Note that Spark's *MetricsSystem* class is marked as *private[spark]* and thus isn't intended to be interacted with directly by users. It's not super

Microsoft SQL jdbc support from spark sql

2015-04-06 Thread bipin
Hi, I am trying to pull data from ms-sql server. I have tried using the spark.sql.jdbc CREATE TEMPORARY TABLE c USING org.apache.spark.sql.jdbc OPTIONS ( url jdbc:sqlserver://10.1.0.12:1433\;databaseName=dbname\;, dbtable Customer ); But it shows java.sql.SQLException: No suitable driver found

graphx running time

2015-04-06 Thread daze5112
Hi im currently using graphx for some analysis and have come into a bit of a hurdle. If use my test dataset of 20 nodes and about 30 links it runs really quickly. I have two other data sets i use one of 10million links and one of 20 million. When i create my graphs seems to work okay and i can get

Re: Spark 1.2.0 with Play/Activator

2015-04-06 Thread andy petrella
Hello Manish, you can take a look at the spark-notebook build, it's a bit tricky to get rid of some clashes but at least you can refer to this build to have ideas. LSS, I have stripped out akka from play deps. ref: https://github.com/andypetrella/spark-notebook/blob/master/build.sbt

Re: Sending RDD object over the network

2015-04-06 Thread Akhil Das
Are you expecting to receive 1 to 100 values in your second program? RDD is just an abstraction, you would need to do like: num.foreach(x = send(x)) Thanks Best Regards On Mon, Apr 6, 2015 at 1:56 AM, raggy raghav0110...@gmail.com wrote: For a class project, I am trying to utilize 2 spark

Re: Sending RDD object over the network

2015-04-06 Thread Raghav Shankar
Hey Akhil, Thanks for your response! No, I am not expecting to receive the values themselves. I am just trying to receive the RDD object on my second Spark application. However, I get a NPE when I try to use the object within my second program. Would you know how I can properly send the RDD

Re: Low resource when upgrading from 1.1.0 to 1.3.0

2015-04-06 Thread Roy.Wang
I also meet the same problem. I deploy and run spark(version:1.3.0) on local mode. when i run a simple app that counts lines of a file, the console prints TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient

Re: Write to Parquet File in Python

2015-04-06 Thread Akriti23
Thank you so much for your reply. We would like to provide a tool to the user to convert a binary file to a file in Avro/Parquet format on his own computer. The tool will parse binary file in python, and convert the data to Parquet. (BTW can we append to parquet file). The issue is that we do not

Re: Learning Spark

2015-04-06 Thread Akhil Das
We had few sessions at Sigmoid, you could go through the meetup page for details: http://www.meetup.com/Real-Time-Data-Processing-and-Cloud-Computing/ On 6 Apr 2015 18:01, Abhideep Chakravarty abhideep.chakrava...@mindtree.com wrote: Hi all, We are here planning to setup a Spark learning

Re: (send this email to subscribe)

2015-04-06 Thread Ted Yu
Please send email to user-subscr...@spark.apache.org On Mon, Apr 6, 2015 at 6:52 AM, 林晨 bewit...@gmail.com wrote:

RDD generated on every query

2015-04-06 Thread Siddharth Ubale
Hi , In Spark Web Application the RDD is generating every time client is sending a query request. Is there any way where the RDD is compiled once and run query again and again on active SparkContext? Thanks, Siddharth Ubale, Synchronized Communications #43, Velankani Tech Park, Block No. II,

Re: Learning Spark

2015-04-06 Thread Ted Yu
bq. I need to know on what all databases You can access HBase using Spark. Cheers On Mon, Apr 6, 2015 at 5:59 AM, Akhil Das ak...@sigmoidanalytics.com wrote: We had few sessions at Sigmoid, you could go through the meetup page for details:

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork Sail
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041 ` After building Spark with command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package I try to run Pi example on YARN with the following command: export

What happened to the Row class in 1.3.0?

2015-04-06 Thread ARose
I am trying to call Row.create(object[]) similarly to what's shown in this programming guide https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema , but the create() method is no longer recognized. I tried to look up the documentation for the Row

Cannot build learning spark project

2015-04-06 Thread Adamantios Corais
Hi, I am trying to build this project https://github.com/databricks/learning-spark with mvn package.This should work out of the box but unfortunately it doesn't. In fact, I get the following error: mvn pachage -X Apache Maven 3.0.5 Maven home: /usr/share/maven Java version: 1.7.0_76, vendor:

Re: Cannot build learning spark project

2015-04-06 Thread Sean Owen
(This mailing list concerns Spark itself rather than the book about Spark. Your question is about building code that isn't part of Spark, so, the right place to ask is https://github.com/databricks/learning-spark You have a typo in pachage but I assume that's just your typo in this email.) On

(send this email to subscribe)

2015-04-06 Thread 林晨

Learning Spark

2015-04-06 Thread Abhideep Chakravarty
Hi all, We are here planning to setup a Spark learning session series. I need all of your input to create a TOC for this program i.e. what all to cover if we need to start from basics and upto what we should go to cover all the aspects of Spark in details. Also, I need to know on what all

Using DIMSUM with ids

2015-04-06 Thread James
The example below illustrates how to use the DIMSUM algorithm to calculate the similarity between each two rows and output row pairs with cosine simiarity that is not less than a threshold.

Spark 1.3.0: Running Pi example on YARN fails

2015-04-06 Thread Zork
I have `Hadoop 2.6.0.2.2.0.0-2041` with `Hive 0.14.0.2.2.0.0-2041 ` After building Spark with command: mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests package I try to run Pi example on YARN with the following command: export

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Ted Yu
From scaladoc of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala : * To create a new Row, use [[RowFactory.create()]] in Java or [[Row.apply()]] in Scala. * Cheers On Mon, Apr 6, 2015 at 7:23 AM, ARose ashley.r...@telarix.com wrote: I am trying to call Row.create(object[])

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Ted Yu
I searched code base but didn't find RowFactory class. Pardon me. On Mon, Apr 6, 2015 at 7:39 AM, Ted Yu yuzhih...@gmail.com wrote: From scaladoc of sql/catalyst/src/main/scala/org/apache/spark/sql/Row.scala : * To create a new Row, use [[RowFactory.create()]] in Java or [[Row.apply()]]

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Nan Zhu
Row class was not documented mistakenly in 1.3.0 you can check the 1.3.1 API doc http://people.apache.org/~pwendell/spark-1.3.1-rc1-docs/api/scala/index.html#org.apache.spark.sql.Row Best, -- Nan Zhu http://codingcat.me On Monday, April 6, 2015 at 10:23 AM, ARose wrote: I am trying to

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Nan Zhu
Hi, Ted It’s here: https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java Best, -- Nan Zhu http://codingcat.me On Monday, April 6, 2015 at 10:44 AM, Ted Yu wrote: I searched code base but didn't

Spark Avarage

2015-04-06 Thread barisak
Hi I have a class in above desc. case class weatherCond(dayOfdate: String, minDeg: Int, maxDeg: Int, meanDeg: Int) I am reading the data from csv file and I put this data into weatherCond class with this code val weathersRDD = sc.textFile(weather.csv).map { line = val

How to work with sparse data in Python?

2015-04-06 Thread SecondDatke
I'm trying to apply Spark to a NLP problem that I'm working around. I have near 4 million tweets text and I have converted them into word vectors. It's pretty sparse because each message just has dozens of words but the vocabulary has tens of thousand words. These vectors should be loaded each

DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska
Hi folks, currently have a DF that has a factor variable -- say gender. I am hoping to use the RandomForest algorithm on this data an it appears that this needs to be converted to RDD[LabeledPoint] first -- i.e. all features need to be double-encoded. I see

Re: What happened to the Row class in 1.3.0?

2015-04-06 Thread Ted Yu
Thanks Nan. I was searching for RowFactory.scala Cheers On Mon, Apr 6, 2015 at 7:52 AM, Nan Zhu zhunanmcg...@gmail.com wrote: Hi, Ted It’s here: https://github.com/apache/spark/blob/61b427d4b1c4934bd70ed4da844b64f0e9a377aa/sql/catalyst/src/main/java/org/apache/spark/sql/RowFactory.java

Re: Spark + Kinesis

2015-04-06 Thread Vadim Bichutskiy
Hi all, I am wondering, has anyone on this list been able to successfully implement Spark on top of Kinesis? Best, Vadim ᐧ On Sun, Apr 5, 2015 at 1:50 PM, Vadim Bichutskiy vadim.bichuts...@gmail.com wrote: ᐧ Hi all, Below is the output that I am getting. My Kinesis stream has 1 shard, and

Re: Spark Streaming 1.3 Kafka Direct Streams

2015-04-06 Thread Neelesh
Somewhat agree on subclassing and its issues. It looks like the alternative in spark 1.3.0 to create a custom build. Is there an enhancement filed for this? If not, I'll file one. Thanks! -neelesh On Wed, Apr 1, 2015 at 12:46 PM, Tathagata Das t...@databricks.com wrote: The challenge of

Re: Spark Avarage

2015-04-06 Thread Yana Kadiyska
If you're going to do it this way, I would ouput dayOfdate.substring(0,7), i.e. the month part, and instead of weatherCond, you can use (month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD: RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in multiple Spark

Re: WordCount example

2015-04-06 Thread Mohit Anchlia
Interesting, I see 0 cores in the UI? - *Cores:* 0 Total, 0 Used On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das t...@databricks.com wrote: What does the Spark Standalone UI at port 8080 say about number of cores? On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia mohitanch...@gmail.com