Thanks Saurabh!

That explode function looks like it is exactly what I need.

We will be using MLlib quite a lot - Do I have to worry about python
versions for that?

John

On Wed, Jun 22, 2016 at 4:34 PM, Saurabh Sardeshpande <saurabh...@gmail.com>
wrote:

> Hi John,
>
> If you can do it in Hive, you should be able to do it in Spark. Just make
> sure you import HiveContext instead of SQLContext.
>
> If your intent is to explore rather than get stuff done, I've not aware of
> any RDD operations that do this for you, but there is a DataFrame operation
> called 'explode' which does this -
> https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.functions.explode.
> You'll just have to generate the array of dates using something like this -
> http://stackoverflow.com/questions/7274267/print-all-day-dates-between-two-dates
> .
>
> It's generally recommended to use Python 3 if you're starting a new
> project and don't have old dependencies. But remember that there is still
> quite a lot of stuff that is not yet ported to Python 3.
>
> Regards,
> Saurabh
>
> On Wed, Jun 22, 2016 at 3:20 PM, John Aherne <john.ahe...@justenough.com>
> wrote:
>
>> Hi Everyone,
>>
>> I am pretty new to Spark (and the mailing list), so forgive me if the
>> answer is obvious.
>>
>> I have a dataset, and each row contains a start date and end date.
>>
>> I would like to explode each row so that each day between the start and
>> end dates becomes its own row.
>> e.g.
>> row1  2015-01-01  2015-01-03
>> becomes
>> row1   2015-01-01
>> row1   2015-01-02
>> row1   2015-01-03
>>
>> So, my questions are:
>> Is Spark a good place to do that?
>> I can do it in Hive, but it's a bit messy, and this seems like a good
>> problem to use for learning Spark (and Python).
>>
>> If so, any pointers on what methods I should use? Particularly how to
>> split one row into multiples.
>>
>> Lastly, I am a bit hesitant to ask but is there a recommendation on which
>> version of python to use? Not interested in which is better, just want to
>> know if they are both supported equally.
>>
>> I am using Spark 1.6.1 (Hortonworks distro).
>>
>> Thanks!
>> John
>>
>> --
>>
>> John Aherne
>> Big Data and SQL Developer
>>
>> [image: JustEnough Logo]
>>
>> Cell:
>> Email:
>> Skype:
>> Web:
>>
>> +1 (303) 809-9718
>> john.ahe...@justenough.com
>> john.aherne.je
>> www.justenough.com
>>
>>
>> Confidentiality Note: The information contained in this email and 
>> document(s) attached are for the exclusive use of the addressee and may 
>> contain confidential, privileged and non-disclosable information. If the 
>> recipient of this email is not the addressee, such recipient is strictly 
>> prohibited from reading, photocopying, distribution or otherwise using this 
>> email or its contents in any way.
>>
>>
>


-- 

John Aherne
Big Data and SQL Developer

[image: JustEnough Logo]

Cell:
Email:
Skype:
Web:

+1 (303) 809-9718
john.ahe...@justenough.com
john.aherne.je
www.justenough.com


Confidentiality Note: The information contained in this email and
document(s) attached are for the exclusive use of the addressee and
may contain confidential, privileged and non-disclosable information.
If the recipient of this email is not the addressee, such recipient is
strictly prohibited from reading, photocopying, distribution or
otherwise using this email or its contents in any way.

Reply via email to