Fwd:

2015-04-02 Thread Himanish Kushary
Actually they may not be sequentially generated and also the list (RDD)
could come from a different component.

For example from this RDD :

(105,918)
(105,757)
(502,516)
(105,137)
(516,816)
(350,502)

I would like to separate into two RDD's :

1) (105,918)
 (502,516)

 2) (105,757)
 (105,137)
  (516,816)
  (350,502)

Right now I am using a mutable Set variable to track the elements already
selected. After coalescing the RDD to a single partition I am doing
something like :

val evalCombinations = collection.mutable.Set.empty[String]

val currentValidCombinations = allCombinations

  .filter(p => {
  if(!evalCombinations.contains(p._1) && !evalCombinations.contains(p._2)) {
evalCombinations += p._1;evalCombinations += p._2; true
  } else
false
})

This approach is limited by memory of the executor this runs
on.Appreciate any better more scalable solution.

Thanks



On Wed, Mar 25, 2015 at 3:13 PM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:

> You're generating all possible pairs?
>
> In that case, why not just generate the sequential pairs you want from the
> start?
>
> On Wed, Mar 25, 2015 at 3:11 PM, Himanish Kushary 
> wrote:
>
>> It will only give (A,B). I am generating the pair from combinations of
>> the the strings A,B,C and D, so the pairs (ignoring order) would be
>>
>> (A,B),(A,C),(A,D),(B,C),(B,D),(C,D)
>>
>> On successful filtering using the original condition it will transform to
>> (A,B) and (C,D)
>>
>> On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld <
>> nkronenfeld@uncharted.software> wrote:
>>
>>> What would it do with the following dataset?
>>>
>>> (A, B)
>>> (A, C)
>>> (B, D)
>>>
>>>
>>> On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a RDD of pairs of strings like below :
>>>>
>>>> (A,B)
>>>> (B,C)
>>>> (C,D)
>>>> (A,D)
>>>> (E,F)
>>>> (B,F)
>>>>
>>>> I need to transform/filter this into a RDD of pairs that does not
>>>> repeat a string once it has been used once. So something like ,
>>>>
>>>> (A,B)
>>>> (C,D)
>>>> (E,F)
>>>>
>>>> (B,C) is out because B has already ben used in (A,B), (A,D) is out
>>>> because A (and D) has been used etc.
>>>>
>>>> I was thinking of a option of using a shared variable to keep track of
>>>> what has already been used but that may only work for a single partition
>>>> and would not scale for larger dataset.
>>>>
>>>> Is there any other efficient way to accomplish this ?
>>>>
>>>> --
>>>> Thanks & Regards
>>>> Himanish
>>>>
>>>
>>>
>>
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish



-- 
Thanks & Regards
Himanish


Re:

2015-03-25 Thread Himanish Kushary
It will only give (A,B). I am generating the pair from combinations of the
the strings A,B,C and D, so the pairs (ignoring order) would be

(A,B),(A,C),(A,D),(B,C),(B,D),(C,D)

On successful filtering using the original condition it will transform to
(A,B) and (C,D)

On Wed, Mar 25, 2015 at 3:00 PM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:

> What would it do with the following dataset?
>
> (A, B)
> (A, C)
> (B, D)
>
>
> On Wed, Mar 25, 2015 at 1:02 PM, Himanish Kushary 
> wrote:
>
>> Hi,
>>
>> I have a RDD of pairs of strings like below :
>>
>> (A,B)
>> (B,C)
>> (C,D)
>> (A,D)
>> (E,F)
>> (B,F)
>>
>> I need to transform/filter this into a RDD of pairs that does not repeat
>> a string once it has been used once. So something like ,
>>
>> (A,B)
>> (C,D)
>> (E,F)
>>
>> (B,C) is out because B has already ben used in (A,B), (A,D) is out
>> because A (and D) has been used etc.
>>
>> I was thinking of a option of using a shared variable to keep track of
>> what has already been used but that may only work for a single partition
>> and would not scale for larger dataset.
>>
>> Is there any other efficient way to accomplish this ?
>>
>> --
>> Thanks & Regards
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish


[no subject]

2015-03-25 Thread Himanish Kushary
Hi,

I have a RDD of pairs of strings like below :

(A,B)
(B,C)
(C,D)
(A,D)
(E,F)
(B,F)

I need to transform/filter this into a RDD of pairs that does not repeat a
string once it has been used once. So something like ,

(A,B)
(C,D)
(E,F)

(B,C) is out because B has already ben used in (A,B), (A,D) is out because
A (and D) has been used etc.

I was thinking of a option of using a shared variable to keep track of what
has already been used but that may only work for a single partition and
would not scale for larger dataset.

Is there any other efficient way to accomplish this ?

-- 
Thanks & Regards
Himanish


Re: Tools to manage workflows on Spark

2015-03-01 Thread Himanish Kushary
We are running our Spark jobs on Amazon AWS and are using AWS Datapipeline
for orchestration of the different spark jobs. AWS datapipeline provides
automatic EMR cluster provisioning, retry on failure,SNS notification etc.
out of the box and works well for us.





On Sun, Mar 1, 2015 at 7:02 PM, Felix C  wrote:

>  We use Oozie as well, and it has worked well.
> The catch is each action in Oozie is separate and one cannot retain
> SparkContext or RDD, or leverage caching or temp table, going into another
> Oozie action. You could either save output to file or put all Spark
> processing into one Oozie action.
>
> --- Original Message ---
>
> From: "Mayur Rustagi" 
> Sent: February 28, 2015 7:07 PM
> To: "Qiang Cao" 
> Cc: "Ted Yu" , "Ashish Nigam" ,
> "user" 
> Subject: Re: Tools to manage workflows on Spark
>
>  Sorry not really. Spork is a way to migrate your existing pig scripts to
> Spark or write new pig jobs then can execute on spark.
> For orchestration you are better off using Oozie especially if you are
> using other execution engines/systems besides spark.
>
>
> Regards,
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoid.com 
> @mayur_rustagi 
>
> On Sat, Feb 28, 2015 at 6:59 PM, Qiang Cao  wrote:
>
> Thanks Mayur! I'm looking for something that would allow me to easily
> describe and manage a workflow on Spark. A workflow in my context is a
> composition of Spark applications that may depend on one another based on
> hdfs inputs/outputs. Is Spork a good fit? The orchestration I want is on
> app level.
>
>
>
> On Sat, Feb 28, 2015 at 9:38 PM, Mayur Rustagi 
> wrote:
>
> We do maintain it but in apache repo itself. However Pig cannot do
> orchestration for you. I am not sure what you are looking at from Pig in
> this context.
>
> Regards,
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoid.com 
>  @mayur_rustagi 
>
> On Sat, Feb 28, 2015 at 6:36 PM, Ted Yu  wrote:
>
> Here was latest modification in spork repo:
> Mon Dec 1 10:08:19 2014
>
>  Not sure if it is being actively maintained.
>
> On Sat, Feb 28, 2015 at 6:26 PM, Qiang Cao  wrote:
>
> Thanks for the pointer, Ashish! I was also looking at Spork
> https://github.com/sigmoidanalytics/spork Pig-on-Spark), but wasn't sure
> if that's the right direction.
>
> On Sat, Feb 28, 2015 at 6:36 PM, Ashish Nigam 
> wrote:
>
> You have to call spark-submit from oozie.
> I used this link to get the idea for my implementation -
>
>
> http://mail-archives.apache.org/mod_mbox/oozie-user/201404.mbox/%3CCAHCsPn-0Grq1rSXrAZu35yy_i4T=fvovdox2ugpcuhkwmjp...@mail.gmail.com%3E
>
>
>
>  On Feb 28, 2015, at 3:25 PM, Qiang Cao  wrote:
>
>  Thanks, Ashish! Is Oozie integrated with Spark? I knew it can
> accommodate some Hadoop jobs.
>
>
> On Sat, Feb 28, 2015 at 6:07 PM, Ashish Nigam 
> wrote:
>
> Qiang,
> Did you look at Oozie?
> We use oozie to run spark jobs in production.
>
>
>  On Feb 28, 2015, at 2:45 PM, Qiang Cao  wrote:
>
>  Hi Everyone,
>
>  We need to deal with workflows on Spark. In our scenario, each workflow
> consists of multiple processing steps. Among different steps, there could
> be dependencies.  I'm wondering if there are tools available that can
> help us schedule and manage workflows on Spark. I'm looking for something
> like pig on Hadoop, but it should fully function on Spark.
>
>  Any suggestion?
>
>  Thanks in advance!
>
>  Qiang
>
>
>
>
>
>
>
>
>
>


-- 
Thanks & Regards
Himanish


Re: High CPU usage in Driver

2015-02-27 Thread Himanish Kushary
Hi,

I was able to solve the issue. Putting down the settings that worked for me.

1) It was happening due to the large number of partitions.I *coalesce*'d
the RDD as early as possible in my code into lot less partitions ( used .
coalesce(1) to bring down from 500K to 10k)

2) Increased the settings for the parameters *spark.akka.frameSize (=
500), **spark.akka.timeout,**spark.akka.askTimeout and
**spark.core.connection.ack.wait.timeout
*to get rid of any insufficient frame size and timeout errors

Thanks
Himanish

On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary 
wrote:

> Hi,
>
> I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is loaded
> into memory , the size is around 18G.
>
> Whenever I run a distinct() on the RDD, the driver ( spark-shell in
> yarn-client mode) host CPU usage rockets up (400+ %) and the distinct()
> process seems to stall.The spark driver UI also hangs.
>
> In ganglia the only node with high load is the driver host. I have tried
> repartitioning the data into less number of partitions ( using coalesce or
> repartition) with no luck.
>
> I have attached the jstack output which shows few threads in BLOCKED
> status. Not sure what exactly is going on here.
>
> The driver program was started with 15G memory on AWS EMR. Appreciate any
> thoughts regarding the issue.
>
> --
> Thanks & Regards
> Himanish
>



-- 
Thanks & Regards
Himanish


Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
Hello Imran,

Thanks for your response. I noticed the "intersection" and "subtract"
methods for a RDD, does they work based on hash off all the fields in a RDD
record ?

- Himanish

On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid  wrote:

> the more scalable alternative is to do a join (or a variant like cogroup,
> leftOuterJoin, subtractByKey etc. found in PairRDDFunctions)
>
> the downside is this requires a shuffle of both your RDDs
>
> On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary 
> wrote:
>
>> Hi,
>>
>> I have two RDD's with csv data as below :
>>
>> RDD-1
>>
>> 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
>> 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
>> 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
>> 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603
>> 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643
>> 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385
>> 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164
>>
>> RDD-2
>>
>> 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160
>> 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238
>> 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164
>> 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164
>> 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164
>>
>> 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164
>> 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164
>>
>> I need to filter RDD-2 to include only those records where the first
>> column value in RDD-2 matches any of the first column values in RDD-1
>>
>> Currently , I am broadcasting the first column values from RDD-1 as a
>> list and then filtering RDD-2 based on that list.
>>
>> val rdd1broadcast = sc.broadcast(rdd1.map { uu => uu.split(",")(0) 
>> }.collect().toSet)
>>
>> val rdd2filtered = rdd2.filter{ h => 
>> rdd1broadcast.value.contains(h.split(",")(0)) }
>>
>> This will result in data with first column "101970_5854301838" (last two 
>> records) to be filtered out from RDD-2.
>>
>> Is this is the best way to accomplish this ? I am worried that for large 
>> data volume , the broadcast step may become an issue. Appreciate any other 
>> suggestion.
>>
>> ---
>> Thanks
>> Himanish
>>
>
>


-- 
Thanks & Regards
Himanish


Filter data from one RDD based on data from another RDD

2015-02-19 Thread Himanish Kushary
Hi,

I have two RDD's with csv data as below :

RDD-1

101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647
101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603
101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643
101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385
101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164

RDD-2

101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160
101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238
101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164
101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164
101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164

101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164
101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164

I need to filter RDD-2 to include only those records where the first column
value in RDD-2 matches any of the first column values in RDD-1

Currently , I am broadcasting the first column values from RDD-1 as a list
and then filtering RDD-2 based on that list.

val rdd1broadcast = sc.broadcast(rdd1.map { uu => uu.split(",")(0)
}.collect().toSet)

val rdd2filtered = rdd2.filter{ h =>
rdd1broadcast.value.contains(h.split(",")(0)) }

This will result in data with first column "101970_5854301838" (last
two records) to be filtered out from RDD-2.

Is this is the best way to accomplish this ? I am worried that for
large data volume , the broadcast step may become an issue. Appreciate
any other suggestion.

---
Thanks
Himanish