Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
irst step would be devote some time and > identify which of these 112 factors are actually causative. Some domain > knowledge of the data may be required. Then, you can start of with PCA. > > HTH, > > Regards, > > Sivakumaran S > > On 08-Aug-2016, at 3:01 PM, Tony Lane &l

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
Great question Rohit. I am in my early days of ML as well and it would be great if we get some idea on this from other experts on this group. I know we can reduce dimensions by using PCA, but i think that does not allow us to understand which factors from the original are we using in the end. -

Re: Kmeans dataset initialization

2016-08-06 Thread Tony Lane
Can anyone suggest how I can initialize kmeans structure directly from a dataset of Row On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane <tonylane@gmail.com> wrote: > I have all the data required for KMeans in a dataset in memory > > Standard approach to load this data from a file

Kmeans dataset initialization

2016-08-05 Thread Tony Lane
I have all the data required for KMeans in a dataset in memory Standard approach to load this data from a file is spark.read().format("libsvm").load(filename) where the file has data in the format 0 1:0.0 2:0.0 3:0.0 How do i this from an in-memory dataset already present. Any suggestions ?

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Mike. I have figured how to do this . Thanks for the suggestion. It works great. I am trying to figure out the performance impact of this. thanks again On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane@gmail.com> wrote: > @mike - this looks great. How can i do this in java

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
t; > Mike > > On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane@gmail.com> wrote: > > Ayan - basically i have a dataset with structure, where bid are unique > string values > > bid: String > val : integer > > I need unique int values for these string bid''s

Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > Can you explain a little further? > > best > Ayan > > On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com> wrote: > >> I have a row with structure like >> >&

Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
I have a row with structure like identifier: String value: int All identifier are unique and I want to generate a unique long id for the data and get a row object back for further processing. I understand using the zipWithUniqueId function on RDD, but that would mean first converting to RDD and

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
n Owen <so...@cloudera.com> wrote: > You mean "new int[] {0,1,2}" because vectors are 0-indexed. > > On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane <tonylane@gmail.com> wrote: > > Hi Sean, > > > > I did not understand, > > I created a KMean

Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
wrote: > You declare that the vector has 3 dimensions, but then refer to its > 4th dimension (at index 3). That is the error. > > On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com> wrote: > > I am using the following vector definition in java > > >

Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I am using the following vector definition in java Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 })) However when I run the predict method on this vector it leads to Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3 at

Re: Stop Spark Streaming Jobs

2016-08-03 Thread Tony Lane
SparkSession exposes stop() method On Wed, Aug 3, 2016 at 8:53 AM, Pradeep wrote: > Thanks Park. I am doing the same. Was trying to understand if there are > other ways. > > Thanks, > Pradeep > > > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee >

Error in building spark core on windows - any suggestions please

2016-08-03 Thread Tony Lane
I am trying to build spark in windows, and getting the following test failures and consequent build failures. [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @ spark-core_2.11 --- --- T E S T S

error while running filter on dataframe

2016-07-31 Thread Tony Lane
Can someone help me understand this error which occurs while running a filter on a dataframe 2016-07-31 21:01:57 ERROR CodeGenerator:91 - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 117, Column 58: Expression "mapelements_isNull" is not an rvalue

spark java - convert string to date

2016-07-31 Thread Tony Lane
Any built in function in java with spark to convert string to date more efficiently or do we just use the standard java techniques -Tony

Visualization of data analysed using spark

2016-07-30 Thread Tony Lane
I am developing my analysis application by using spark (in eclipse as the IDE) what is a good way to visualize the data, taking into consideration i have multiple files which make up my spark application. I have seen some notebook demo's but not sure how to use my application with such

Re: how to order data in descending order in spark dataset

2016-07-30 Thread Tony Lane
just to clarify I am try to do this in java ts.groupBy("b").count().orderBy("count"); On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane <tonylane@gmail.com> wrote: > ts.groupBy("b").count().orderBy("count"); > > how can I order this data in descending order of count > Any suggestions > > -Tony >

how to order data in descending order in spark dataset

2016-07-30 Thread Tony Lane
ts.groupBy("b").count().orderBy("count"); how can I order this data in descending order of count Any suggestions -Tony

Spark 2.0 blocker on windows - spark-warehouse path issue

2016-07-30 Thread Tony Lane
Caused by: java.net.URISyntaxException: Relative path in absolute URI: file:C:/ibm/spark-warehouse Anybody knows a solution to this? cheers tony

Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-29 Thread Tony Lane
I am facing the same issue and completely blocked here. *Sean can you please help with this issue. * Migrating to 2.0.0 has really stalled our development effort. -Tony > -- Forwarded message -- > From: Sean Owen > Date: Fri, Jul 29, 2016 at 12:47 AM >