Re: Pig on Spark

suman bharadwaj Thu, 24 Apr 2014 23:00:07 -0700

Hey Mayur,

We use HiveColumnarLoader and XMLLoader. Are these working as well ?


Will try few things regarding porting Java MR.

Regards,
Suman Bharadwaj S


On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi <mayur.rust...@gmail.com>wrote:

> Right now UDF is not working. Its in the top list though. You should be
> able to soon :)
> Are thr any other functionality of pig you use often apart from the usual
> suspects??
>
> Existing Java MR jobs would be a easier move. are these cascading jobs or
> single map reduce jobs. If single then you should be able to,  write a
> scala wrapper code code to call map & reduce functions with some magic &
> let your core code be. Would be interesting to see an actual example & get
> it to work.
>
> Regards
> Mayur
>
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, Apr 24, 2014 at 2:46 AM, suman bharadwaj <suman....@gmail.com>wrote:
>
>> We currently are in the process of converting PIG and Java map reduce
>> jobs to SPARK jobs. And we have written couple of PIG UDFs as well. Hence
>> was checking if we can leverage SPORK without converting to SPARK jobs.
>>
>> And is there any way I can port my existing Java MR jobs to SPARK ?
>> I know this thread has a different subject, let me know if need to ask
>> this question in separate thread.
>>
>> Thanks in advance.
>>
>>
>> On Thu, Apr 24, 2014 at 2:13 AM, Mayur Rustagi 
>> <mayur.rust...@gmail.com>wrote:
>>
>>> UDF
>>> Generate
>>> & many many more are not working :)
>>>
>>> Several of them work. Joins, filters, group by etc.
>>> I am translating the ones we need, would be happy to get help on others.
>>> Will host a jira to track them if you are intersted.
>>>
>>>
>>> Mayur Rustagi
>>> Ph: +1 (760) 203 3257
>>> http://www.sigmoidanalytics.com
>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>
>>>
>>>
>>> On Thu, Apr 24, 2014 at 2:10 AM, suman bharadwaj <suman....@gmail.com>wrote:
>>>
>>>> Are all the features available in PIG working in SPORK ?? Like for eg:
>>>> UDFs ?
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> On Thu, Apr 24, 2014 at 1:54 AM, Mayur Rustagi <mayur.rust...@gmail.com
>>>> > wrote:
>>>>
>>>>> Thr are two benefits I get as of now
>>>>> 1. Most of the time a lot of customers dont want the full power but
>>>>> they want something dead simple with which they can do dsl. They end up
>>>>> using Hive for a lot of ETL just cause its SQL & they understand it. Pig 
>>>>> is
>>>>> close & wraps up a lot of framework level semantics away from the user &
>>>>> lets him focus on data flow
>>>>> 2. Some have codebases in Pig already & are just looking to do it
>>>>> faster. I am yet to benchmark that on Pig on spark.
>>>>>
>>>>> I agree that pig on spark cannot solve a lot problems but it can solve
>>>>> some without forcing the end customer to do anything even close to coding,
>>>>> I believe thr is quite some value in making Spark accessible to larger
>>>>> group of audience.
>>>>> End of the day to each his own :)
>>>>>
>>>>> Regards
>>>>> Mayur
>>>>>
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +1 (760) 203 3257
>>>>> http://www.sigmoidanalytics.com
>>>>> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 24, 2014 at 1:24 AM, Bharath Mundlapudi <
>>>>> mundlap...@gmail.com> wrote:
>>>>>
>>>>>> This seems like an interesting question.
>>>>>>
>>>>>> I love Apache Pig. It is so natural and the language flows with nice
>>>>>> syntax.
>>>>>>
>>>>>> While I was at Yahoo! in core Hadoop Engineering, I have used Pig a
>>>>>> lot for analytics and provided feedback to Pig Team to do much more
>>>>>> functionality when it was at version 0.7. Lots of new functionality got
>>>>>> offered now
>>>>>> .
>>>>>> End of the day, Pig is a DSL for data flows. There will be always
>>>>>> gaps and enhancements. I was often thought is DSL right way to solve data
>>>>>> flow problems? May be not, we need complete language construct. We may 
>>>>>> have
>>>>>> found the answer - Scala. With Scala's dynamic compilation, we can write
>>>>>> much power constructs than any DSL can provide.
>>>>>>
>>>>>> If I am a new organization and beginning to choose, I would go with
>>>>>> Scala.
>>>>>>
>>>>>> Here is the example:
>>>>>>
>>>>>> #!/bin/sh
>>>>>> exec scala "$0" "$@"
>>>>>> !#
>>>>>> YOUR DSL GOES HERE BUT IN SCALA!
>>>>>>
>>>>>> You have DSL like scripting, functional and complete language power!
>>>>>> If we can improve first 3 lines, here you go, you have most powerful DSL 
>>>>>> to
>>>>>> solve data problems.
>>>>>>
>>>>>> -Bharath
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Mar 10, 2014 at 11:00 PM, Xiangrui Meng <men...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi Sameer,
>>>>>>>
>>>>>>> Lin (cc'ed) could also give you some updates about Pig on Spark
>>>>>>> development on her side.
>>>>>>>
>>>>>>> Best,
>>>>>>> Xiangrui
>>>>>>>
>>>>>>> On Mon, Mar 10, 2014 at 12:52 PM, Sameer Tilak <ssti...@live.com>
>>>>>>> wrote:
>>>>>>> > Hi Mayur,
>>>>>>> > We are planning to upgrade our distribution MR1> MR2 (YARN) and
>>>>>>> the goal is
>>>>>>> > to get SPROK set up next month. I will keep you posted. Can you
>>>>>>> please keep
>>>>>>> > me informed about your progress as well.
>>>>>>> >
>>>>>>> > ________________________________
>>>>>>> > From: mayur.rust...@gmail.com
>>>>>>> > Date: Mon, 10 Mar 2014 11:47:56 -0700
>>>>>>> >
>>>>>>> > Subject: Re: Pig on Spark
>>>>>>> > To: user@spark.apache.org
>>>>>>> >
>>>>>>> >
>>>>>>> > Hi Sameer,
>>>>>>> > Did you make any progress on this. My team is also trying it out
>>>>>>> would love
>>>>>>> > to know some detail so progress.
>>>>>>> >
>>>>>>> > Mayur Rustagi
>>>>>>> > Ph: +1 (760) 203 3257
>>>>>>> > http://www.sigmoidanalytics.com
>>>>>>> > @mayur_rustagi
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thu, Mar 6, 2014 at 2:20 PM, Sameer Tilak <ssti...@live.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > Hi Aniket,
>>>>>>> > Many thanks! I will check this out.
>>>>>>> >
>>>>>>> > ________________________________
>>>>>>> > Date: Thu, 6 Mar 2014 13:46:50 -0800
>>>>>>> > Subject: Re: Pig on Spark
>>>>>>> > From: aniket...@gmail.com
>>>>>>> > To: user@spark.apache.org; tgraves...@yahoo.com
>>>>>>> >
>>>>>>> >
>>>>>>> > There is some work to make this work on yarn at
>>>>>>> > https://github.com/aniket486/pig. (So, compile pig with ant
>>>>>>> > -Dhadoopversion=23)
>>>>>>> >
>>>>>>> > You can look at
>>>>>>> https://github.com/aniket486/pig/blob/spork/pig-spark to
>>>>>>> > find out what sort of env variables you need (sorry, I haven't
>>>>>>> been able to
>>>>>>> > clean this up- in-progress). There are few known issues with this,
>>>>>>> I will
>>>>>>> > work on fixing them soon.
>>>>>>> >
>>>>>>> > Known issues-
>>>>>>> > 1. Limit does not work (spork-fix)
>>>>>>> > 2. Foreach requires to turn off schema-tuple-backend (should be a
>>>>>>> pig-jira)
>>>>>>> > 3. Algebraic udfs dont work (spork-fix in-progress)
>>>>>>> > 4. Group by rework (to avoid OOMs)
>>>>>>> > 5. UDF Classloader issue (requires SPARK-1053, then you can put
>>>>>>> > pig-withouthadoop.jar as SPARK_JARS in SparkContext along with udf
>>>>>>> jars)
>>>>>>> >
>>>>>>> > ~Aniket
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thu, Mar 6, 2014 at 1:36 PM, Tom Graves <tgraves...@yahoo.com>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> > I had asked a similar question on the dev mailing list a while
>>>>>>> back (Jan
>>>>>>> > 22nd).
>>>>>>> >
>>>>>>> > See the archives:
>>>>>>> >
>>>>>>> http://mail-archives.apache.org/mod_mbox/spark-dev/201401.mbox/browser->
>>>>>>> > look for spork.
>>>>>>> >
>>>>>>> > Basically Matei said:
>>>>>>> >
>>>>>>> > Yup, that was it, though I believe people at Twitter picked it up
>>>>>>> again
>>>>>>> > recently. I'd suggest
>>>>>>> > asking Dmitriy if you know him. I've seen interest in this from
>>>>>>> several
>>>>>>> > other groups, and
>>>>>>> > if there's enough of it, maybe we can start another open source
>>>>>>> repo to
>>>>>>> > track it. The work
>>>>>>> > in that repo you pointed to was done over one week, and already
>>>>>>> had most of
>>>>>>> > Pig's operators
>>>>>>> > working. (I helped out with this prototype over Twitter's hack
>>>>>>> week.) That
>>>>>>> > work also calls
>>>>>>> > the Scala API directly, because it was done before we had a Java
>>>>>>> API; it
>>>>>>> > should be easier
>>>>>>> > with the Java one.
>>>>>>> >
>>>>>>> >
>>>>>>> > Tom
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > On Thursday, March 6, 2014 3:11 PM, Sameer Tilak <ssti...@live.com>
>>>>>>> wrote:
>>>>>>> > Hi everyone,
>>>>>>> >
>>>>>>> > We are using to Pig to build our data pipeline. I came across
>>>>>>> Spork -- Pig
>>>>>>> > on Spark at: https://github.com/dvryaboy/pig and not sure if it
>>>>>>> is still
>>>>>>> > active.
>>>>>>> >
>>>>>>> > Can someone please let me know the status of Spork or any other
>>>>>>> effort that
>>>>>>> > will let us run Pig on Spark? We can significantly benefit by
>>>>>>> using Spark,
>>>>>>> > but we would like to keep using the existing Pig scripts.
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > "...:::Aniket:::... Quetzalco@tl"
>>>>>>> >
>>>>>>> >
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Pig on Spark

Reply via email to