Re: Pig on Spark

Mayur Rustagi Wed, 23 Apr 2014 14:40:26 -0700

Right now UDF is not working. Its in the top list though. You should be
able to soon :)
Are thr any other functionality of pig you use often apart from the usual
suspects??


Existing Java MR jobs would be a easier move. are these cascading jobs or
single map reduce jobs. If single then you should be able to,  write a
scala wrapper code code to call map & reduce functions with some magic &
let your core code be. Would be interesting to see an actual example & get
it to work.

Regards
Mayur


Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj <suman....@gmail.com>wrote:

> We currently are in the process of converting PIG and Java map reduce jobs
> to SPARK jobs. And we have written couple of PIG UDFs as well. Hence was
> checking if we can leverage SPORK without converting to SPARK jobs.
>
> And is there any way I can port my existing Java MR jobs to SPARK ?
> I know this thread has a different subject, let me know if need to ask
> this question in separate thread.
>
> Thanks in advance.
>
>
> On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:
>
>> UDF
>> Generate
>> & many many more are not working :)
>>
>> Several of them work. Joins, filters, group by etc.
>> I am translating the ones we need, would be happy to get help on others.
>> Will host a jira to track them if you are intersted.
>>
>>
>> Mayur Rustagi
>> Ph: +1 (760) 203 3257
>> http://www.sigmoidanalytics.com
>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>
>>
>>
>> On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <suman....@gmail.com>wrote:
>>
>>> Are all the features available in PIG working in SPORK ?? Like for eg:
>>> UDFs ?
>>>
>>> Thanks.
>>>
>>>
>>> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi 
>>> <mayur.rust...@gmail.com>wrote:
>>>
>>>> Thr are two benefits I get as of now
>>>> 1. Most of the time a lot of customers dont want the full power but
>>>> they want something dead simple with which they can do dsl. They end up
>>>> using Hive for a lot of ETL just cause its SQL & they understand it. Pig is
>>>> close & wraps up a lot of framework level semantics away from the user &
>>>> lets him focus on data flow
>>>> 2. Some have codebases in Pig already & are just looking to do it
>>>> faster. I am yet to benchmark that on Pig on spark.
>>>>
>>>> I agree that pig on spark cannot solve a lot problems but it can solve
>>>> some without forcing the end customer to do anything even close to coding,
>>>> I believe thr is quite some value in making Spark accessible to larger
>>>> group of audience.
>>>> End of the day to each his own :)
>>>>
>>>> Regards
>>>> Mayur
>>>>
>>>>
>>>> Mayur Rustagi
>>>> Ph: +1 (760) 203 3257
>>>> http://www.sigmoidanalytics.com
>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>
>>>>
>>>>
>>>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <
>>>> mundlap...@gmail.com> wrote:
>>>>
>>>>> This seems like an interesting question.
>>>>>
>>>>> I love Apache Pig. It is so natural and the language flows with nice
>>>>> syntax.
>>>>>
>>>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
>>>>> lot for analytics and provided feedback to Pig Team to do much more
>>>>> functionality when it was at version 0.7. Lots of new functionality got
>>>>> offered now
>>>>> .
>>>>> End of the day, Pig is a DSL for data flows. There will be always gaps
>>>>> and enhancements. I was often thought is DSL right way to solve data flow
>>>>> problems? May be not, we need complete language construct. We may have
>>>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>>>> much power constructs than any DSL can provide.
>>>>>
>>>>> If I am a new organization and beginning to choose, I would go with
>>>>> Scala.
>>>>>
>>>>> Here is the example:
>>>>>
>>>>> #!/bin/sh
>>>>> exec scala "$0" "$@"
>>>>> !#
>>>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>>>
>>>>> You have DSL like scripting, functional and complete language power!
>>>>> If we can improve first 3 lines, here you go, you have most powerful DSL 
>>>>> to
>>>>> solve data problems.
>>>>>
>>>>> -Bharath
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com>wrote:
>>>>>
>>>>>> Hi Sameer,
>>>>>>
>>>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>>>> development on her side.
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com>
>>>>>> wrote:
>>>>>> > Hi Mayur,
>>>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and the
>>>>>> goal is
>>>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>>>> please keep
>>>>>> > me informed about your progress as well.
>>>>>> >
>>>>>> > ________________________________
>>>>>> > From: mayur.rust...@gmail.com
>>>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>>>> >
>>>>>> > Subject: Re: Pig on Spark
>>>>>> > To: user@spark.apache.org
>>>>>> >
>>>>>> >
>>>>>> > Hi Sameer,
>>>>>> > Did you make any progress on this. My team is also trying it out
>>>>>> would love
>>>>>> > to know some detail so progress.
>>>>>> >
>>>>>> > Mayur Rustagi
>>>>>> > Ph: +1 (760) 203 3257
>>>>>> > http://www.sigmoidanalytics.com
>>>>>> > @mayur_rustagi
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > Hi Aniket,
>>>>>> > Many thanks! I will check this out.
>>>>>> >
>>>>>> > ________________________________
>>>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>>>> > Subject: Re: Pig on Spark
>>>>>> > From: aniket...@gmail.com
>>>>>> > To: user@spark.apache.org; tgraves...@yahoo.com
>>>>>> >
>>>>>> >
>>>>>> > There is some work to make this work on yarn at
>>>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>>>> > -Dhadoopversion=23)
>>>>>> >
>>>>>> > You can look at
>>>>>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>>>>>> > find out what sort of env variables you need (sorry, I haven't been
>>>>>> able to
>>>>>> > clean this up- in-progress). There are few known issues with this,
>>>>>> I will
>>>>>> > work on fixing them soon.
>>>>>> >
>>>>>> > Known issues-
>>>>>> > 1. Limit does not work (spork-fix)
>>>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>>>> pig-jira)
>>>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>>>> > 4. Group by rework (to avoid OOMs)
>>>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>>>> jars)
>>>>>> >
>>>>>> > ~Aniket
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > I had asked a similar question on the dev mailing list a while back
>>>>>> (Jan
>>>>>> > 22nd).
>>>>>> >
>>>>>> > See the archives:
>>>>>> >
>>>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>>>> > look for spork.
>>>>>> >
>>>>>> > Basically Matei said:
>>>>>> >
>>>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>>>> again
>>>>>> > recently. I'd suggest
>>>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>>>> several
>>>>>> > other groups, and
>>>>>> > if there's enough of it, maybe we can start another open source
>>>>>> repo to
>>>>>> > track it. The work
>>>>>> > in that repo you pointed to was done over one week, and already had
>>>>>> most of
>>>>>> > Pig's operators
>>>>>> > working. (I helped out with this prototype over Twitter's hack
>>>>>> week.) That
>>>>>> > work also calls
>>>>>> > the Scala API directly, because it was done before we had a Java
>>>>>> API; it
>>>>>> > should be easier
>>>>>> > with the Java one.
>>>>>> >
>>>>>> >
>>>>>> > Tom
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com>
>>>>>> wrote:
>>>>>> > Hi everyone,
>>>>>> >
>>>>>> > We are using to Pig to build our data pipeline. I came across Spork
>>>>>> -- Pig
>>>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it is
>>>>>> still
>>>>>> > active.
>>>>>> >
>>>>>> > Can someone please let me know the status of Spork or any other
>>>>>> effort that
>>>>>> > will let us run Pig on Spark? We can significantly benefit by using
>>>>>> Spark,
>>>>>> > but we would like to keep using the existing Pig scripts.
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>>>> >
>>>>>> >
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pig on Spark

Reply via email to