Re: Tensor Flow

2016-12-12 Thread tog
Tensorframes is a project from databricks (
https://github.com/databricks/tensorframes). No commit for a couple of
months though.

Does anyone have an insight on the status of the project?

On Mon, 12 Dec 2016 at 19:31 Meeraj Kunnumpurath 
wrote:

> Apologies. okay, I will have a look at Tensor Frames.
>
> On Mon, Dec 12, 2016 at 10:29 PM, Meeraj Kunnumpurath <
> mee...@servicesymphony.com> wrote:
>
> https://www.tensorflow.org/
>
> Thank you
>
> On Mon, Dec 12, 2016 at 10:27 PM, Guillaume Alleon <
> guillaume.all...@gmail.com> wrote:
>
> Tensorframes?
>
> Sent from a small device
>
> On 12 Dec 2016, at 19:23, Meeraj Kunnumpurath 
> wrote:
>
> Hello,
>
> Is there anything available in Spark similar to Tensor Flow? I am looking
> at a mechanism for performing nearest neighbour search on vectorized image
> data.
>
> Regards
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com *
>
>
>
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com *
>
>
>
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com *
>


Apache Groovy and Spark

2015-11-18 Thread tog
Hi

I start playing with both Apache projects and quickly got that exception.
Anyone being able to give some hint on the problem so that I can dig
further.
It seems to be a problem for Spark to load some of the groovy classes ...

Any idea?
Thanks
Guillaume


tog GroovySpark $ $GROOVY_HOME/bin/groovy
GroovySparkThroughGroovyShell.groovy

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage
0.0 (TID 1, localhost): java.lang.ClassNotFoundException:
Script1$_run_closure1

at java.net.URLClassLoader.findClass(URLClassLoader.java:381)

at java.lang.ClassLoader.loadClass(ClassLoader.java:424)

at java.lang.ClassLoader.loadClass(ClassLoader.java:357)

at java.lang.Class.forName0(Native Method)

at java.lang.Class.forName(Class.java:348)

at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)

at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)

at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)

at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)

at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)

at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)

at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)

at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:72)

at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:98)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)

at scala.Option.foreach(Option.scala:236)

at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48

Re: Apache Groovy and Spark

2015-11-18 Thread tog
Hi Steve

Since you are familiar with groovy it will go a bit deeper in details. My
(simple) groovy scripts are working fine with Apache Spark - a closure
(when dehydrated) will nicely serialize.
My issue comes when I want to use GroovyShell to run my scripts (my
ultimate goal is to integrate with Apache Zeppelin so I would need to use
GroovyShell to run the scripts) and this is where I got the previous
exception.

Sure you may question the use of Groovy while Scala/Python are nicely
supported. For me it is more a way to support the wider Java community ...
and after all scala/groovy/java are all working on the JVM! Beyond that
point, I would be interested to know from the Spark community if there is
any plan to integrate closer with java especially with a Java REPL landing
in Java 9.

Cheers
Guillaume

On 18 November 2015 at 21:51, Steve Loughran <ste...@hortonworks.com> wrote:

>
> Looks like groovy scripts dont' serialize over the wire properly.
>
> Back in 2011 I hooked up groovy to mapreduce, so that you could do mappers
> and reducers there; "grumpy"
>
> https://github.com/steveloughran/grumpy
>
> slides: http://www.slideshare.net/steve_l/hadoop-gets-groovy
>
> What I ended up doing (see slide 13) was send the raw script around as
> text and compile it in to a Script instance at the far end. Compilation
> took some time, but the real barrier is that groovy is not at all fast.
>
> It used to be 10x slow, maybe now with static compiles and the java7
> invoke-dynamic JARs things are better. I'm still unsure I'd use it in
> production, and, given spark's focus on Scala and Python, I'd pick one of
> those two
>
>
> On 18 Nov 2015, at 20:35, tog <guillaume.all...@gmail.com> wrote:
>
> Hi
>
> I start playing with both Apache projects and quickly got that exception.
> Anyone being able to give some hint on the problem so that I can dig
> further.
> It seems to be a problem for Spark to load some of the groovy classes ...
>
> Any idea?
> Thanks
> Guillaume
>
>
> tog GroovySpark $ $GROOVY_HOME/bin/groovy
> GroovySparkThroughGroovyShell.groovy
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> 0.0 (TID 1, localhost): java.lang.ClassNotFoundException:
> Script1$_run_closure1
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
>
> at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
>
> at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
>
> at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
>
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>
> at java.io.ObjectInputStream

Re: converting categorical values in csv file to numerical values

2015-11-05 Thread tog
Hi Bala

Can't you do a simple dictionnary and map those values to numbers?

Cheers
Guillaume

On 5 November 2015 at 09:54, Balachandar R.A. 
wrote:

> HI
>
>
> I am new to spark MLlib and machine learning. I have a csv file that
> consists of around 100 thousand rows and 20 columns. Of these 20 columns,
> 10 contains string values. Each value in these columns are not necessarily
> unique. They are kind of categorical, that is, the values could be one
> amount, say 10 values. To start with, I could run examples, especially,
> random forest algorithm in my local spark (1.5.1.) platform. However, I
> have a challenge with my dataset due to these strings as the APIs takes
> numerical values. Can any one tell me how I can map these categorical
> values (strings) into numbers and use them with random forest algorithms?
> Any example will be greatly appreciated.
>
>
> regards
>
> Bala
>



-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: converting categorical values in csv file to numerical values

2015-11-05 Thread tog
If you corpus is large (nlp) this is indeed the best solution otherwise
(few words I.e. Categories)  I guess you will end up with the same result

On Friday, 6 November 2015, Balachandar R.A. <balachandar...@gmail.com>
wrote:

> Hi Guillaume,
>
>
> This is always an option. However, I read about HashingTF which exactly
> does this quite efficiently and can scale too. Hence, looking for a
> solution using this technique.
>
>
> regards
> Bala
>
>
> On 5 November 2015 at 18:50, tog <guillaume.all...@gmail.com
> <javascript:_e(%7B%7D,'cvml','guillaume.all...@gmail.com');>> wrote:
>
>> Hi Bala
>>
>> Can't you do a simple dictionnary and map those values to numbers?
>>
>> Cheers
>> Guillaume
>>
>> On 5 November 2015 at 09:54, Balachandar R.A. <balachandar...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','balachandar...@gmail.com');>> wrote:
>>
>>> HI
>>>
>>>
>>> I am new to spark MLlib and machine learning. I have a csv file that
>>> consists of around 100 thousand rows and 20 columns. Of these 20 columns,
>>> 10 contains string values. Each value in these columns are not necessarily
>>> unique. They are kind of categorical, that is, the values could be one
>>> amount, say 10 values. To start with, I could run examples, especially,
>>> random forest algorithm in my local spark (1.5.1.) platform. However, I
>>> have a challenge with my dataset due to these strings as the APIs takes
>>> numerical values. Can any one tell me how I can map these categorical
>>> values (strings) into numbers and use them with random forest algorithms?
>>> Any example will be greatly appreciated.
>>>
>>>
>>> regards
>>>
>>> Bala
>>>
>>
>>
>>
>> --
>> PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net
>>
>
>

-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: Question abt serialization

2015-07-28 Thread tog
Hi Akhil

I have it working now with Groovy REPL in a form similar to the one you are
mentionning. Still I dont understand why the previous form (with the Function)
is raising that exception.

Cheers
Guillaume

On 28 July 2015 at 08:56, Akhil Das ak...@sigmoidanalytics.com wrote:

 Did you try it with just: (comment out line 27)

 println Count of spark:  + file.filter({s - s.contains('spark')}).
 count()

 Thanks
 Best Regards

 On Sun, Jul 26, 2015 at 12:43 AM, tog guillaume.all...@gmail.com wrote:

 Hi

 I have been using Spark for quite some time using either scala or python.
 I wanted to give a try to groovy through scripts for small tests.

 Unfortunately I get the following exception (using that simple script
 https://gist.github.com/galleon/d6540327c418aa8a479f)

 Is there anything I am not doing correctly here.

 Thanks

 tog Groovy4Spark $ groovy GroovySparkWordcount.groovy

 class org.apache.spark.api.java.JavaRDD

 true

 true

 Caught: org.apache.spark.SparkException: Task not serializable

 org.apache.spark.SparkException: Task not serializable

 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)

 at
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)

 at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)

 at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)

 at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)

 at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)

 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)

 at
 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)

 at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

 at org.apache.spark.rdd.RDD.filter(RDD.scala:310)

 at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)

 at org.apache.spark.api.java.JavaRDD$filter$0.call(Unknown Source)

 at GroovySparkWordcount.run(GroovySparkWordcount.groovy:27)

 Caused by: java.io.NotSerializableException: GroovySparkWordcount

 Serialization stack:

 - object not serializable (class: GroovySparkWordcount, value:
 GroovySparkWordcount@7eee6c13)

 - field (class: GroovySparkWordcount$1, name: this$0, type: class
 GroovySparkWordcount)

 - object (class GroovySparkWordcount$1, GroovySparkWordcount$1@15c16f19)

 - field (class: org.apache.spark.api.java.JavaRDD$$anonfun$filter$1,
 name: f$1, type: interface org.apache.spark.api.java.function.Function)

 - object (class org.apache.spark.api.java.JavaRDD$$anonfun$filter$1,
 function1)

 at
 org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

 at
 org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)

 at
 org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)

 at
 org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)

 ... 12 more


 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Question abt serialization

2015-07-25 Thread tog
Hi

I have been using Spark for quite some time using either scala or python. I
wanted to give a try to groovy through scripts for small tests.

Unfortunately I get the following exception (using that simple script
https://gist.github.com/galleon/d6540327c418aa8a479f)

Is there anything I am not doing correctly here.

Thanks

tog Groovy4Spark $ groovy GroovySparkWordcount.groovy

class org.apache.spark.api.java.JavaRDD

true

true

Caught: org.apache.spark.SparkException: Task not serializable

org.apache.spark.SparkException: Task not serializable

at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:315)

at
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:305)

at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132)

at org.apache.spark.SparkContext.clean(SparkContext.scala:1893)

at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:311)

at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:310)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:286)

at org.apache.spark.rdd.RDD.filter(RDD.scala:310)

at org.apache.spark.api.java.JavaRDD.filter(JavaRDD.scala:78)

at org.apache.spark.api.java.JavaRDD$filter$0.call(Unknown Source)

at GroovySparkWordcount.run(GroovySparkWordcount.groovy:27)

Caused by: java.io.NotSerializableException: GroovySparkWordcount

Serialization stack:

- object not serializable (class: GroovySparkWordcount, value:
GroovySparkWordcount@7eee6c13)

- field (class: GroovySparkWordcount$1, name: this$0, type: class
GroovySparkWordcount)

- object (class GroovySparkWordcount$1, GroovySparkWordcount$1@15c16f19)

- field (class: org.apache.spark.api.java.JavaRDD$$anonfun$filter$1, name:
f$1, type: interface org.apache.spark.api.java.function.Function)

- object (class org.apache.spark.api.java.JavaRDD$$anonfun$filter$1,
function1)

at
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)

at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)

at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:81)

at
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:312)

... 12 more


-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


sliding

2015-07-02 Thread tog
Hi

Sorry for this scala/spark newbie question. I am creating RDD which
represent large time series this way:
val data = sc.textFile(somefile.csv)

case class Event(
time:   Double,
x:  Double,
vztot:  Double
)

val events = data.filter(s = !s.startsWith(GMT)).map{s =
val r = s.split(;)
...
Event(time, x, vztot )
}

I would like to process those RDD in order to reduce them by some
filtering. For this I noticed that sliding could help but I was not able to
use it so far. Here is what I did:

import org.apache.spark.mllib.rdd.RDDFunctions._

val eventsfiltered = events.sliding(3).map(Seq(e0, e1, e2)  =
Event(e0.time, (e0.x+e1.x+e2.x)/3.0, (e0.vztot+e1.vztot+e2.vztot)/3.0))

Thanks for your help


-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: sliding

2015-07-02 Thread tog
Was complaining about the Seq ...

Moved it to
val eventsfiltered = events.sliding(3).map(s  = Event(s(0).time,
(s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0))

and that is working.

Anyway this is not what I wanted to do, my goal was more to implement
bucketing to shorten the time serie.


On 2 July 2015 at 18:25, Feynman Liang fli...@databricks.com wrote:

 What's the error you are getting?

 On Thu, Jul 2, 2015 at 9:37 AM, tog guillaume.all...@gmail.com wrote:

 Hi

 Sorry for this scala/spark newbie question. I am creating RDD which
 represent large time series this way:
 val data = sc.textFile(somefile.csv)

 case class Event(
 time:   Double,
 x:  Double,
 vztot:  Double
 )

 val events = data.filter(s = !s.startsWith(GMT)).map{s =
 val r = s.split(;)
 ...
 Event(time, x, vztot )
 }

 I would like to process those RDD in order to reduce them by some
 filtering. For this I noticed that sliding could help but I was not able to
 use it so far. Here is what I did:

 import org.apache.spark.mllib.rdd.RDDFunctions._

 val eventsfiltered = events.sliding(3).map(Seq(e0, e1, e2)  =
 Event(e0.time, (e0.x+e1.x+e2.x)/3.0, (e0.vztot+e1.vztot+e2.vztot)/3.0))

 Thanks for your help


 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: sliding

2015-07-02 Thread tog
Well it did reduce the length of my serie of events. I will have to dig
what it did actually ;-)

I would assume that it took one out of 3 value, is that correct ?
Would it be possible to control a bit more how the value assigned to the
bucket is computed for example take the first element, the min, the max,
mean ... any other function.

Thanks for putting me on the right track

On 2 July 2015 at 22:56, Feynman Liang fli...@databricks.com wrote:

 How about:

 events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0)

 That would group the RDD into adjacent buckets of size 3.

 On Thu, Jul 2, 2015 at 2:33 PM, tog guillaume.all...@gmail.com wrote:

 Was complaining about the Seq ...

 Moved it to
 val eventsfiltered = events.sliding(3).map(s  = Event(s(0).time,
 (s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0))

 and that is working.

 Anyway this is not what I wanted to do, my goal was more to implement
 bucketing to shorten the time serie.


 On 2 July 2015 at 18:25, Feynman Liang fli...@databricks.com wrote:

 What's the error you are getting?

 On Thu, Jul 2, 2015 at 9:37 AM, tog guillaume.all...@gmail.com wrote:

 Hi

 Sorry for this scala/spark newbie question. I am creating RDD which
 represent large time series this way:
 val data = sc.textFile(somefile.csv)

 case class Event(
 time:   Double,
 x:  Double,
 vztot:  Double
 )

 val events = data.filter(s = !s.startsWith(GMT)).map{s =
 val r = s.split(;)
 ...
 Event(time, x, vztot )
 }

 I would like to process those RDD in order to reduce them by some
 filtering. For this I noticed that sliding could help but I was not able to
 use it so far. Here is what I did:

 import org.apache.spark.mllib.rdd.RDDFunctions._

 val eventsfiltered = events.sliding(3).map(Seq(e0, e1, e2)  =
 Event(e0.time, (e0.x+e1.x+e2.x)/3.0, (e0.vztot+e1.vztot+e2.vztot)/3.0))

 Thanks for your help


 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: sliding

2015-07-02 Thread tog
Understood. Thanks for your great help

Cheers
Guillaume

On 2 July 2015 at 23:23, Feynman Liang fli...@databricks.com wrote:

 Consider an example dataset [a, b, c, d, e, f]

 After sliding(3), you get [(a,b,c), (b,c,d), (c, d, e), (d, e, f)]

 After zipWithIndex: [((a,b,c), 0), ((b,c,d), 1), ((c, d, e), 2), ((d, e,
 f), 3)]

 After filter: [((a,b,c), 0), ((d, e, f), 3)], which is what I'm assuming
 you want (non-overlapping buckets)? You can then do something like
 .map(func(_._1)) to apply func (e.g. min, max, mean) to the 3-tuples.

 On Thu, Jul 2, 2015 at 3:20 PM, tog guillaume.all...@gmail.com wrote:

 Well it did reduce the length of my serie of events. I will have to dig
 what it did actually ;-)

 I would assume that it took one out of 3 value, is that correct ?
 Would it be possible to control a bit more how the value assigned to the
 bucket is computed for example take the first element, the min, the max,
 mean ... any other function.

 Thanks for putting me on the right track

 On 2 July 2015 at 22:56, Feynman Liang fli...@databricks.com wrote:

 How about:

 events.sliding(3).zipWithIndex.filter(_._2 % 3 == 0)

 That would group the RDD into adjacent buckets of size 3.

 On Thu, Jul 2, 2015 at 2:33 PM, tog guillaume.all...@gmail.com wrote:

 Was complaining about the Seq ...

 Moved it to
 val eventsfiltered = events.sliding(3).map(s  = Event(s(0).time,
 (s(0).x+s(1).x+s(2).x)/3.0 (s(0).vztot+s(1).vztot+s(2).vztot)/3.0))

 and that is working.

 Anyway this is not what I wanted to do, my goal was more to implement
 bucketing to shorten the time serie.


 On 2 July 2015 at 18:25, Feynman Liang fli...@databricks.com wrote:

 What's the error you are getting?

 On Thu, Jul 2, 2015 at 9:37 AM, tog guillaume.all...@gmail.com
 wrote:

 Hi

 Sorry for this scala/spark newbie question. I am creating RDD which
 represent large time series this way:
 val data = sc.textFile(somefile.csv)

 case class Event(
 time:   Double,
 x:  Double,
 vztot:  Double
 )

 val events = data.filter(s = !s.startsWith(GMT)).map{s =
 val r = s.split(;)
 ...
 Event(time, x, vztot )
 }

 I would like to process those RDD in order to reduce them by some
 filtering. For this I noticed that sliding could help but I was not able 
 to
 use it so far. Here is what I did:

 import org.apache.spark.mllib.rdd.RDDFunctions._

 val eventsfiltered = events.sliding(3).map(Seq(e0, e1, e2)  =
 Event(e0.time, (e0.x+e1.x+e2.x)/3.0, (e0.vztot+e1.vztot+e2.vztot)/3.0))

 Thanks for your help


 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net





-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: Time series data

2015-06-29 Thread tog
Hi

Have you tested the Cloudera project:
https://github.com/cloudera/spark-timeseries ?
Let me know how did you progress on that route as I am also interested in
that topic ?

Cheers



On 26 June 2015 at 14:07, Caio Cesar Trucolo truc...@gmail.com wrote:

 Hi everyone!

 I am working with multiple time series data and in summary I have to
 adjust each time series (like inserting average values in data gaps) and
 then training regression models with mllib for each time series. The
 adjustment step I did with the adjustement function being mapped for each
 element of RDD (in this case being the ID[as key] and the grouped by key
 features). But for the regression models, it was not possible because the
 functions need RDDs and my solution would be map each element (grouped as
 time series) to a function of training. How can I deal with time series
 data in this context with Spark? I did'nt find a way.

 Thank you

 --
 Caio




-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


Re: spark and binary files

2015-05-12 Thread tog
Thanks for the answer but this seems to apply for files that are havin a
key-value structure which I currently don't have. My file is a generic
binary file encoding data from sensors over time. I am just looking at
recreating some objects by assigning splits (ie continuous chunks of bytes)
to each computing elements (maps ?) the outcome of this filter will be an
RDD of observations which will have been detected and recreated from that
file.

On Saturday, May 9, 2015, ayan guha guha.a...@gmail.com wrote:

 Spark uses any inputformat you specify and number of splits=number of RDD
 partitions. You may want to take a deeper look at
 SparkContext.newAPIHadoopRDD to load your data.



 On Sat, May 9, 2015 at 4:48 PM, tog guillaume.all...@gmail.com
 javascript:_e(%7B%7D,'cvml','guillaume.all...@gmail.com'); wrote:

 Hi

 I havé an application that currently run using MR. It currently starts
 extracting information from a proprietary binary file that is copied to
 HDFS. The application starts by creating business objects from information
 extracted from the binary files. Later those objects are used for further
 processing using again MR jobs.

 I am planning to move towards Spark and I clearly see that I could use
 JavaRDDbusinessObjects for parallel processing. however it is not yet
 obvious what could be the process to generate this RDD from my binary file
 in parallel.

 Today I use parallelism based on the split assign to each of the map
 elements of a job. Can I mimick such a thing using spark. All example I
 have seen so far are using text files for which I guess the partitions are
 based on a given number of contiguous lines.

 Any help or pointer would be appreciated

 Cheers
 Guillaume


 --
 PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net




 --
 Best Regards,
 Ayan Guha



-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


spark and binary files

2015-05-09 Thread tog
Hi

I havé an application that currently run using MR. It currently starts
extracting information from a proprietary binary file that is copied to
HDFS. The application starts by creating business objects from information
extracted from the binary files. Later those objects are used for further
processing using again MR jobs.

I am planning to move towards Spark and I clearly see that I could use
JavaRDDbusinessObjects for parallel processing. however it is not yet
obvious what could be the process to generate this RDD from my binary file
in parallel.

Today I use parallelism based on the split assign to each of the map
elements of a job. Can I mimick such a thing using spark. All example I
have seen so far are using text files for which I guess the partitions are
based on a given number of contiguous lines.

Any help or pointer would be appreciated

Cheers
Guillaume


-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net


parallelism on binary file

2015-05-08 Thread tog
Hi

I havé an application that currently run using MR. It currently starts
extracting information from a proprietary binary file that is copied to
HDFS. The application starts by creating business objects from information
extracted from the binary files. Later those objects are used for further
processing using again MR jobs.

I am planning to move towards Spark and I clearly see that I could use
JavaRDDbusinessObjects for parallel processing. however it is not yet
obvious what could be the process to generate this RDD from my binary file
in parallel.

Today I use parallelism based on the split assign to each of the map
elements of a job. Can I mimick such a thing using spark. All example I
have seen so far are using text files for which I guess the partitions are
based on a given number of contiguous lines.

Any help or pointer would be appreciated

Cheers
Guillaume



-- 
PGP KeyID: 2048R/EA31CFC9  subkeys.pgp.net