Re: pyspark 1.4 udf change date values

2015-07-17 Thread Luis Guerra
Sure, I have created JIRA SPARK-9131 - UDF change data values <https://issues.apache.org/jira/browse/SPARK-9131> On Thu, Jul 16, 2015 at 7:09 PM, Davies Liu wrote: > Thanks for reporting this, could you file a JIRA for it? > > On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra >

pyspark 1.4 udf change date values

2015-07-16 Thread Luis Guerra
Hi all, I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you

Apply function to all elements along each key

2015-01-20 Thread Luis Guerra
Hi all, I would like to apply a function over all elements for each key (assuming key-value RDD). For instance, imagine I have: import numpy as np a = np.array([[1, 'hola', 'adios'],[2, 'hi', 'bye'],[2, 'hello', 'goodbye']]) a = sc.parallelize(a) Then I want to create a key-value RDD, using the

Re: Spark executors resources. Blocking?

2015-01-13 Thread Luis Guerra
r Yarn or Mesos to schedule the system. > The same issues will come up, but they have a much broader range of > approaches that you can take to solve the problem. > > > > Dave > > > > *From:* Luis Guerra [mailto:luispelay...@gmail.com] > *Sent:* Monday, January 12, 201

Spark executors resources. Blocking?

2015-01-12 Thread Luis Guerra
Hello all, I have a naive question regarding how spark uses the executors in a cluster of machines. Imagine the scenario in which I do not know the input size of my data in execution A, so I set Spark to use 20 (out of my 25 nodes, for instance). At the same time, I also launch a second execution

"Ungroup" data

2014-09-25 Thread Luis Guerra
Hi everyone, I need some advice about how to make the following: having a RDD of vectors (each vector being Vector(Int, Int , Int, int)), I need to group the data, then I need to apply a function to every group comparing each consecutive item within a group and retaining a variable (that has to be

Time difference between Python and Scala

2014-09-19 Thread Luis Guerra
Hello everyone, What should be the normal time difference between Scala and Python using Spark? I mean running the same program in the same cluster environment. In my case I am using numpy array structures for the Python code and vectors for the Scala code, both for handling my data. The time dif

Re: Number of partitions when saving (pyspark)

2014-09-17 Thread Luis Guerra
is carried out only in 4 stages. What am I doing wrong? On Wed, Sep 17, 2014 at 6:20 PM, Davies Liu wrote: > On Wed, Sep 17, 2014 at 5:21 AM, Luis Guerra > wrote: > > Hi everyone, > > > > Is it possible to fix the number of tasks related to a saveAsTextFile in > &

Number of partitions when saving (pyspark)

2014-09-17 Thread Luis Guerra
Hi everyone, Is it possible to fix the number of tasks related to a saveAsTextFile in Pyspark? I am loading several files from HDFS, fixing the number of partitions to X (let's say 40 for instance). Then some transformations, like joins and filters are carried out. The weird thing here is that th

Re: Spark execution plan

2014-07-23 Thread Luis Guerra
Thanks for your answer. However, there has been a missunderstanding here. My question is related to control the execution in parallel of different parts of code, similarly to PIG, where there is a planning phase before the execution. On Wed, Jul 23, 2014 at 1:46 PM, chutium wrote: > it seems un

Spark execution plan

2014-07-23 Thread Luis Guerra
Hi all, I was wondering how spark may deal with an execution plan. Using PIG as example and its DAG execution, I would like to manage Spark for a similar solution. For instance, if my code has 3 different "parts", being A and B self-sufficient parts: Part A: .. . . var output_a Part

Re: class after join

2014-07-17 Thread Luis Guerra
compared to > making many of these values classes. > > (Although, if you otherwise needed a class that represented "all of > the things in class A and class B", this could be done easily with > composition, a class with an A and a B inside.) > > On Thu, Jul 17, 2014 at 9:1

class after join

2014-07-17 Thread Luis Guerra
Hi all, I am a newbie Spark user with many doubts, so sorry if this is a "silly" question. I am dealing with tabular data formatted as text files, so when I first load the data, my code is like this: case class data_class( V1: String, V2: String, V3: String, V4: String, V5: String