Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
There must be an algorithmic way to figure out which of these factors
contribute the least and remove them in the analysis.
I am hoping same one can throw some insight on this.

On Mon, Aug 8, 2016 at 7:41 PM, Sivakumaran S <siva.kuma...@me.com> wrote:

> Not an expert here, but the first step would be devote some time and
> identify which of these 112 factors are actually causative. Some domain
> knowledge of the data may be required. Then, you can start of with PCA.
>
> HTH,
>
> Regards,
>
> Sivakumaran S
>
> On 08-Aug-2016, at 3:01 PM, Tony Lane <tonylane@gmail.com> wrote:
>
> Great question Rohit.  I am in my early days of ML as well and it would be
> great if we get some idea on this from other experts on this group.
>
> I know we can reduce dimensions by using PCA, but i think that does not
> allow us to understand which factors from the original are we using in the
> end.
>
> - Tony L.
>
> On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha <rohitchaddha1...@gmail.com>
> wrote:
>
>>
>> I have a data-set where each data-point has 112 factors.
>>
>> I want to remove the factors which are not relevant, and say reduce to 20
>> factors out of these 112 and then do clustering of data-points using these
>> 20 factors.
>>
>> How do I do these and how do I figure out which of the 20 factors are
>> useful for analysis.
>>
>> I see SVD and PCA implementations, but I am not sure if these give which
>> elements are removed and which are remaining.
>>
>> Can someone please help me understand what to do here
>>
>> thanks,
>> -Rohit
>>
>>
>
>


Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Tony Lane
Great question Rohit.  I am in my early days of ML as well and it would be
great if we get some idea on this from other experts on this group.

I know we can reduce dimensions by using PCA, but i think that does not
allow us to understand which factors from the original are we using in the
end.

- Tony L.

On Mon, Aug 8, 2016 at 5:12 PM, Rohit Chaddha 
wrote:

>
> I have a data-set where each data-point has 112 factors.
>
> I want to remove the factors which are not relevant, and say reduce to 20
> factors out of these 112 and then do clustering of data-points using these
> 20 factors.
>
> How do I do these and how do I figure out which of the 20 factors are
> useful for analysis.
>
> I see SVD and PCA implementations, but I am not sure if these give which
> elements are removed and which are remaining.
>
> Can someone please help me understand what to do here
>
> thanks,
> -Rohit
>
>


Re: Kmeans dataset initialization

2016-08-06 Thread Tony Lane
Can anyone suggest how I can initialize kmeans structure directly from a
dataset of Row

On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane <tonylane@gmail.com> wrote:

> I have all the data required for KMeans in a dataset in memory
>
> Standard approach to load this data from a file is
> spark.read().format("libsvm").load(filename)
>
> where the file has data in the format
> 0 1:0.0 2:0.0 3:0.0
>
>
> How do i this from an in-memory dataset already present.
> Any suggestions ?
>
> -Tony
>
>


Kmeans dataset initialization

2016-08-05 Thread Tony Lane
I have all the data required for KMeans in a dataset in memory

Standard approach to load this data from a file is
spark.read().format("libsvm").load(filename)

where the file has data in the format
0 1:0.0 2:0.0 3:0.0


How do i this from an in-memory dataset already present.
Any suggestions ?

-Tony


Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Mike.

I have figured how to do this .  Thanks for the suggestion. It works
great.  I am trying to figure out the performance impact of this.

thanks again


On Fri, Aug 5, 2016 at 9:25 PM, Tony Lane <tonylane@gmail.com> wrote:

> @mike  - this looks great. How can i do this in java ?   what is the
> performance implication on a large dataset  ?
>
> @sonal  - I can't have a collision in the values.
>
> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com>
> wrote:
>
>> You can use the monotonically_increasing_id method to generate guaranteed
>> unique (but not necessarily consecutive) IDs.  Calling something like:
>>
>> df.withColumn("id", monotonically_increasing_id())
>>
>> You don't mention which language you're using but you'll need to pull in
>> the sql.functions library.
>>
>> Mike
>>
>> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane@gmail.com> wrote:
>>
>> Ayan - basically i have a dataset with structure, where bid are unique
>> string values
>>
>> bid: String
>> val : integer
>>
>> I need unique int values for these string bid''s to do some processing in
>> the dataset
>>
>> like
>>
>> id:int   (unique integer id for each bid)
>> bid:String
>> val:integer
>>
>>
>>
>> -Tony
>>
>> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Can you explain a little further?
>>>
>>> best
>>> Ayan
>>>
>>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com>
>>> wrote:
>>>
>>>> I have a row with structure like
>>>>
>>>> identifier: String
>>>> value: int
>>>>
>>>> All identifier are unique and I want to generate a unique long id for
>>>> the data and get a row object back for further processing.
>>>>
>>>> I understand using the zipWithUniqueId function on RDD, but that would
>>>> mean first converting to RDD and then joining back the RDD and dataset
>>>>
>>>> What is the best way to do this ?
>>>>
>>>> -Tony
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>


Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
@mike  - this looks great. How can i do this in java ?   what is the
performance implication on a large dataset  ?

@sonal  - I can't have a collision in the values.

On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger <m...@flexiblecreations.com>
wrote:

> You can use the monotonically_increasing_id method to generate guaranteed
> unique (but not necessarily consecutive) IDs.  Calling something like:
>
> df.withColumn("id", monotonically_increasing_id())
>
> You don't mention which language you're using but you'll need to pull in
> the sql.functions library.
>
> Mike
>
> On Aug 5, 2016, at 9:11 AM, Tony Lane <tonylane@gmail.com> wrote:
>
> Ayan - basically i have a dataset with structure, where bid are unique
> string values
>
> bid: String
> val : integer
>
> I need unique int values for these string bid''s to do some processing in
> the dataset
>
> like
>
> id:int   (unique integer id for each bid)
> bid:String
> val:integer
>
>
>
> -Tony
>
> On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> Hi
>>
>> Can you explain a little further?
>>
>> best
>> Ayan
>>
>> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com>
>> wrote:
>>
>>> I have a row with structure like
>>>
>>> identifier: String
>>> value: int
>>>
>>> All identifier are unique and I want to generate a unique long id for
>>> the data and get a row object back for further processing.
>>>
>>> I understand using the zipWithUniqueId function on RDD, but that would
>>> mean first converting to RDD and then joining back the RDD and dataset
>>>
>>> What is the best way to do this ?
>>>
>>> -Tony
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


Re: Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
Ayan - basically i have a dataset with structure, where bid are unique
string values

bid: String
val : integer

I need unique int values for these string bid''s to do some processing in
the dataset

like

id:int   (unique integer id for each bid)
bid:String
val:integer



-Tony

On Fri, Aug 5, 2016 at 6:35 PM, ayan guha <guha.a...@gmail.com> wrote:

> Hi
>
> Can you explain a little further?
>
> best
> Ayan
>
> On Fri, Aug 5, 2016 at 10:14 PM, Tony Lane <tonylane@gmail.com> wrote:
>
>> I have a row with structure like
>>
>> identifier: String
>> value: int
>>
>> All identifier are unique and I want to generate a unique long id for the
>> data and get a row object back for further processing.
>>
>> I understand using the zipWithUniqueId function on RDD, but that would
>> mean first converting to RDD and then joining back the RDD and dataset
>>
>> What is the best way to do this ?
>>
>> -Tony
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Generating unique id for a column in Row without breaking into RDD and joining back

2016-08-05 Thread Tony Lane
I have a row with structure like

identifier: String
value: int

All identifier are unique and I want to generate a unique long id for the
data and get a row object back for further processing.

I understand using the zipWithUniqueId function on RDD, but that would mean
first converting to RDD and then joining back the RDD and dataset

What is the best way to do this ?

-Tony


Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I guess the setup of the model and usage of the vector got to me.
Setup takes position 1 , 2 , 3  - like this in the build example - "1:0.0
2:0.0 3:0.0"
I thought I need to follow the same numbering while creating vector too.

thanks a bunch


On Thu, Aug 4, 2016 at 12:39 AM, Sean Owen <so...@cloudera.com> wrote:

> You mean "new int[] {0,1,2}" because vectors are 0-indexed.
>
> On Wed, Aug 3, 2016 at 11:52 AM, Tony Lane <tonylane@gmail.com> wrote:
> > Hi Sean,
> >
> > I did not understand,
> > I created a KMeansModel with 3 dimensions and then I am calling predict
> > method on this model with a 3 dimension vector ?
> > I am not sre what is wrong in this approach. i am missing a point ?
> >
> > Tony
> >
> > On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote:
> >>
> >> You declare that the vector has 3 dimensions, but then refer to its
> >> 4th dimension (at index 3). That is the error.
> >>
> >> On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com>
> wrote:
> >> > I am using the following vector definition in java
> >> >
> >> > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1
> >> > }))
> >> >
> >> > However when I run the predict method on this vector it leads to
> >> >
> >> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
> >> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143)
> >> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115)
> >> > at
> >> >
> >> >
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298)
> >> > at
> >> >
> >> >
> org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606)
> >> > at
> >> >
> >> >
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580)
> >> > at
> >> >
> >> >
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574)
> >> > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
> >> > at
> >> >
> org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574)
> >> > at
> >> >
> >> >
> org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59)
> >> > at
> org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130)
> >
> >
>


Re: Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
Hi Sean,

I did not understand,
I created a KMeansModel with 3 dimensions and then I am calling predict
method on this model with a 3 dimension vector ?
I am not sre what is wrong in this approach. i am missing a point ?

Tony

On Wed, Aug 3, 2016 at 11:22 PM, Sean Owen <so...@cloudera.com> wrote:

> You declare that the vector has 3 dimensions, but then refer to its
> 4th dimension (at index 3). That is the error.
>
> On Wed, Aug 3, 2016 at 10:43 AM, Tony Lane <tonylane@gmail.com> wrote:
> > I am using the following vector definition in java
> >
> > Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 }))
> >
> > However when I run the predict method on this vector it leads to
> >
> > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143)
> > at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115)
> > at
> >
> org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298)
> > at
> >
> org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606)
> > at
> >
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580)
> > at
> >
> org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574)
> > at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
> > at
> org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574)
> > at
> >
> org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59)
> > at org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130)
>


Using sparse vector leads to array out of bounds exception

2016-08-03 Thread Tony Lane
I am using the following vector definition in java

Vectors.sparse(3, new int[] { 1, 2, 3 }, new double[] { 1.1, 1.1, 1.1 }))

However when I run the predict method on this vector it leads to

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 3
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:143)
at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:115)
at
org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:298)
at
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:606)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:580)
at
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:574)
at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:74)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:574)
at
org.apache.spark.mllib.clustering.KMeansModel.predict(KMeansModel.scala:59)
at org.apache.spark.ml.clustering.KMeansModel.predict(KMeans.scala:130)


Re: Stop Spark Streaming Jobs

2016-08-03 Thread Tony Lane
SparkSession exposes stop() method

On Wed, Aug 3, 2016 at 8:53 AM, Pradeep  wrote:

> Thanks Park. I am doing the same. Was trying to understand if there are
> other ways.
>
> Thanks,
> Pradeep
>
> > On Aug 2, 2016, at 10:25 PM, Park Kyeong Hee 
> wrote:
> >
> > So sorry. Your name was Pradeep !!
> >
> > -Original Message-
> > From: Park Kyeong Hee [mailto:kh1979.p...@samsung.com]
> > Sent: Wednesday, August 03, 2016 11:24 AM
> > To: 'Pradeep'; 'user@spark.apache.org'
> > Subject: RE: Stop Spark Streaming Jobs
> >
> > Hi. Paradeep
> >
> >
> > Did you mean, how to kill the job?
> > If yes, you should kill the driver and follow next.
> >
> > on yarn-client
> > 1. find pid - "ps -es | grep "
> > 2. kill it - "kill -9 "
> > 3. check executors were down - "yarn application -list"
> >
> > on yarn-cluster
> > 1. find driver's application ID - "yarn application -list"
> > 2. stop it - "yarn application -kill "
> > 3. check driver and executors were down - "yarn application -list"
> >
> >
> > Thanks.
> >
> > -Original Message-
> > From: Pradeep [mailto:pradeep.mi...@mail.com]
> > Sent: Wednesday, August 03, 2016 10:48 AM
> > To: user@spark.apache.org
> > Subject: Stop Spark Streaming Jobs
> >
> > Hi All,
> >
> > My streaming job reads data from Kafka. The job is triggered and pushed
> to
> > background with nohup.
> >
> > What are the recommended ways to stop job either on yarn-client or
> cluster
> > mode.
> >
> > Thanks,
> > Pradeep
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Error in building spark core on windows - any suggestions please

2016-08-03 Thread Tony Lane
I am trying to build spark in windows, and getting the following test
failures and consequent build failures.

[INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @
spark-core_2.11 ---

---
 T E S T S
---
Running org.apache.spark.api.java.OptionalSuite
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.054 sec -
in org.apache.spark.api.java.OptionalSuite
Running org.apache.spark.JavaAPISuite
Tests run: 90, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 25.792 sec
<<< FAILURE! - in org.apache.spark.JavaAPISuite
wholeTextFiles(org.apache.spark.JavaAPISuite)  Time elapsed: 0.382 sec  <<<
FAILURE!
java.lang.AssertionError:
expected: but was:
at
org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)

Running org.apache.spark.JavaJdbcRDDSuite
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.43 sec -
in org.apache.spark.JavaJdbcRDDSuite
Running org.apache.spark.launcher.SparkLauncherSuite
Tests run: 8, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.047 sec
<<< FAILURE! - in org.apache.spark.launcher.SparkLauncherSuite
testChildProcLauncher(org.apache.spark.launcher.SparkLauncherSuite)  Time
elapsed: 0.032 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
at
org.apache.spark.launcher.SparkLauncherSuite.testChildProcLauncher(SparkLauncherSuite.java:169)

Running org.apache.spark.memory.TaskMemoryManagerSuite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec -
in org.apache.spark.memory.TaskMemoryManagerSuite
Running org.apache.spark.shuffle.sort.PackedRecordPointerSuite
Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.009 sec -
in org.apache.spark.shuffle.sort.PackedRecordPointerSuite
Running org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.132 sec -
in org.apache.spark.shuffle.sort.ShuffleInMemoryRadixSorterSuite
Running org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.162 sec -
in org.apache.spark.shuffle.sort.ShuffleInMemorySorterSuite
Running org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
Tests run: 20, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.597 sec
- in org.apache.spark.shuffle.sort.UnsafeShuffleWriterSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.117 sec
- in org.apache.spark.unsafe.map.BytesToBytesMapOffHeapSuite
Running org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Tests run: 13, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.697 sec
- in org.apache.spark.unsafe.map.BytesToBytesMapOnHeapSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterRadixSortSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.853 sec
- in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorte
rRadixSortSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorterSuite
Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.624 sec
- in org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorte
rSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterRadixSortSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.003 sec -
in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter
RadixSortSuite
Running
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorterSuite
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 sec -
in org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter
Suite
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=512m; support was removed in 8.0

Results :

Failed tests:
  JavaAPISuite.wholeTextFiles:1089 expected: but was:
  SparkLauncherSuite.testChildProcLauncher:169 expected:<0> but was:<1>

Tests run: 195, Failures: 2, Errors: 0, Skipped: 0

[INFO]

[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM ... SUCCESS [
11.038 s]
[INFO] Spark Project Tags . SUCCESS [
11.611 s]
[INFO] Spark Project Sketch ... SUCCESS [
27.037 s]
[INFO] Spark Project Networking ... SUCCESS [
54.003 s]
[INFO] Spark Project Shuffle Streaming Service  SUCCESS [
17.955 s]
[INFO] Spark Project Unsafe ... SUCCESS [
21.667 s]
[INFO] Spark Project Launcher . SUCCESS [
17.632 s]
[INFO] Spark Project Core . FAILURE [04:56
min]
[INFO] Spark Project GraphX ... SKIPPED
[INFO] Spark Project Streaming  

error while running filter on dataframe

2016-07-31 Thread Tony Lane
Can someone help me understand this error which occurs while running a
filter on a dataframe

2016-07-31 21:01:57 ERROR CodeGenerator:91 - failed to compile:
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line
117, Column 58: Expression "mapelements_isNull" is not an rvalue
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ /** Codegened pipeline for:
/* 006 */ * TungstenAggregate(key=[],
functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#127L])
/* 007 */ +- Project
/* 008 */ +- Filter (is...
/* 009 */   */
/* 010 */ final class GeneratedIterator extends
org.apache.spark.sql.execution.BufferedRowIterator {
/* 011 */   private Object[] references;
/* 012 */   private boolean agg_initAgg;
/* 013 */   private boolean agg_bufIsNull;
/* 014 */   private long agg_bufValue;
/* 015 */   private scala.collection.Iterator inputadapter_input;
/* 016 */   private Object[] deserializetoobject_values;
/* 017 */   private org.apache.spark.sql.types.StructType
deserializetoobject_schema;
/* 018 */   private UnsafeRow deserializetoobject_result;
/* 019 */   private
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
deserializetoobject_holder;
/* 020 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
deserializetoobject_rowWriter;
/* 021 */   private UnsafeRow mapelements_result;
/* 022 */   private
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
mapelements_holder;
/* 023 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
mapelements_rowWriter;
/* 024 */   private Object[] serializefromobject_values;
/* 025 */   private UnsafeRow serializefromobject_result;
/* 026 */   private
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
serializefromobject_holder;
/* 027 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
serializefromobject_rowWriter;
/* 028 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
serializefromobject_rowWriter1;
/* 029 */   private org.apache.spark.sql.execution.metric.SQLMetric
filter_numOutputRows;
/* 030 */   private UnsafeRow filter_result;
/* 031 */   private
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
filter_holder;
/* 032 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
filter_rowWriter;
/* 033 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
filter_rowWriter1;
/* 034 */   private org.apache.spark.sql.execution.metric.SQLMetric
agg_numOutputRows;
/* 035 */   private org.apache.spark.sql.execution.metric.SQLMetric
agg_aggTime;
/* 036 */   private UnsafeRow agg_result;
/* 037 */   private
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
/* 038 */   private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
agg_rowWriter;
/* 039 */
/* 040 */   public GeneratedIterator(Object[] references) {
/* 041 */ this.references = references;
/* 042 */   }
/* 043 */


spark java - convert string to date

2016-07-31 Thread Tony Lane
Any built in function in java with spark to convert string to date more
efficiently
or do we just use the standard java techniques

-Tony


Visualization of data analysed using spark

2016-07-30 Thread Tony Lane
I am developing my analysis application by using spark (in eclipse as the
IDE)

what is a good way to visualize the data, taking into consideration i have
multiple files which make up my spark application.

I have seen some notebook demo's but not sure how to use my application
with such notebooks.

thoughts/ suggestions/ experiences -- please share

-Tony


Re: how to order data in descending order in spark dataset

2016-07-30 Thread Tony Lane
just to clarify I am try to do this in java

ts.groupBy("b").count().orderBy("count");



On Sun, Jul 31, 2016 at 12:00 AM, Tony Lane <tonylane@gmail.com> wrote:

> ts.groupBy("b").count().orderBy("count");
>
> how can I order this data in descending order of count
> Any suggestions
>
> -Tony
>


how to order data in descending order in spark dataset

2016-07-30 Thread Tony Lane
ts.groupBy("b").count().orderBy("count");

how can I order this data in descending order of count
Any suggestions

-Tony


Spark 2.0 blocker on windows - spark-warehouse path issue

2016-07-30 Thread Tony Lane
Caused by: java.net.URISyntaxException: Relative path in absolute URI:
file:C:/ibm/spark-warehouse

Anybody knows a solution to this?

cheers
tony


Re: Spark 2.0 -- spark warehouse relative path in absolute URI error

2016-07-29 Thread Tony Lane
I am facing the same issue and completely blocked here.
*Sean can you please help with this issue. *

Migrating to 2.0.0 has really stalled our development effort.

-Tony



> -- Forwarded message --
> From: Sean Owen 
> Date: Fri, Jul 29, 2016 at 12:47 AM
> Subject: Re: Spark 2.0 -- spark warehouse relative path in absolute URI
> error
> To: Rohit Chaddha 
> Cc: "user @spark" 
>
>
> Ah, right. This wasn't actually resolved. Yeah your input on 15899
> would be welcome. See if the proposed fix helps.
>
> On Thu, Jul 28, 2016 at 11:52 AM, Rohit Chaddha
>  wrote:
> > Sean,
> >
> > I saw some JIRA tickets and looks like this is still an open bug (rather
> > than an improvement as marked in JIRA).
> >
> > https://issues.apache.org/jira/browse/SPARK-15893
> > https://issues.apache.org/jira/browse/SPARK-15899
> >
> > I am experimenting, but do you know of any solution on top of your head
> >
> >
> >
> > On Fri, Jul 29, 2016 at 12:06 AM, Rohit Chaddha <
> rohitchaddha1...@gmail.com>
> > wrote:
> >>
> >> I am simply trying to do
> >> session.read().json("file:///C:/data/a.json");
> >>
> >> in 2.0.0-preview it was working fine with
> >> sqlContext.read().json("C:/data/a.json");
> >>
> >>
> >> -Rohit
> >>
> >> On Fri, Jul 29, 2016 at 12:03 AM, Sean Owen  wrote:
> >>>
> >>> Hm, file:///C:/... doesn't work? that should certainly be an absolute
> >>> URI with an absolute path. What exactly is your input value for this
> >>> property?
> >>>
> >>> On Thu, Jul 28, 2016 at 11:28 AM, Rohit Chaddha
> >>>  wrote:
> >>> > Hello Sean,
> >>> >
> >>> > I have tried both  file:/  and file:///
> >>> > Bit it does not work and give the same error
> >>> >
> >>> > -Rohit
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Jul 28, 2016 at 11:51 PM, Sean Owen 
> wrote:
> >>> >>
> >>> >> IIRC that was fixed, in that this is actually an invalid URI. Use
> >>> >> file:/C:/... I think.
> >>> >>
> >>> >> On Thu, Jul 28, 2016 at 10:47 AM, Rohit Chaddha
> >>> >>  wrote:
> >>> >> > I upgraded from 2.0.0-preview to 2.0.0
> >>> >> > and I started getting the following error
> >>> >> >
> >>> >> > Caused by: java.net.URISyntaxException: Relative path in absolute
> >>> >> > URI:
> >>> >> > file:C:/ibm/spark-warehouse
> >>> >> >
> >>> >> > Any ideas how to fix this
> >>> >> >
> >>> >> > -Rohit
> >>> >
> >>> >
> >>
> >>
> >
>
>