But I don't see how it works here with phoenix or hbase coprocessor.
Remember we are joining 2 big data sets here, one is the big file in HDFS,
and records in HBASE. The driving force comes from Hadoop cluster.




On Thu, Sep 3, 2015 at 11:37 AM, Jörn Franke <jornfra...@gmail.com> wrote:

> If you use pig or spark you increase the complexity from an operations
> management perspective significantly. Spark should be seen from a platform
> perspective if it make sense. If you can do it directly with hbase/phoenix
> or only hbase coprocessor then this should be preferred. Otherwise you pay
> more money for maintenance and development.
>
> Le jeu. 3 sept. 2015 à 17:16, Tao Lu <taolu2...@gmail.com> a écrit :
>
>> Yes. Ayan, you approach will work.
>>
>> Or alternatively, use Spark, and write a Scala/Java function which
>> implements similar logic in your Pig UDF.
>>
>> Both approaches look similar.
>>
>> Personally, I would go with Spark solution, it will be slightly faster,
>> and easier if you already have Spark cluster setup on top of your hadoop
>> cluster in your infrastructure.
>>
>> Cheers,
>> Tao
>>
>>
>> On Thu, Sep 3, 2015 at 1:15 AM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Thanks for your info. I am planning to implement a pig udf to do record
>>> look ups. Kindly let me know if this is a good idea.
>>>
>>> Best
>>> Ayan
>>>
>>> On Thu, Sep 3, 2015 at 2:55 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>>>
>>>> You may check if it makes sense to write a coprocessor doing an upsert
>>>> for you, if it does not exist already. Maybe phoenix for Hbase supports
>>>> this already.
>>>>
>>>> Another alternative, if the records do not have an unique Id, is to put
>>>> them into a text index engine, such as Solr or Elasticsearch, which does in
>>>> this case a fast matching with relevancy scores.
>>>>
>>>>
>>>> You can use also Spark and Pig there. However, I am not sure if Spark
>>>> is suitable for these one row lookups. Same holds for Pig.
>>>>
>>>>
>>>> Le mer. 2 sept. 2015 à 23:53, ayan guha <guha.a...@gmail.com> a écrit :
>>>>
>>>> Hello group
>>>>
>>>> I am trying to use pig or spark in order to achieve following:
>>>>
>>>> 1. Write a batch process which will read from a file
>>>> 2. Lookup hbase to see if the record exists. If so then need to compare
>>>> incoming values with hbase and update fields which do not match. Else
>>>> create a new record.
>>>>
>>>> My questions:
>>>> 1. Is this a good use case for pig or spark?
>>>> 2. Is there any way to read hbase for each incoming record in pig
>>>> without writing map reduce code?
>>>> 3. In case of spark I think we have to connect to hbase for every
>>>> record. Is thr any other way?
>>>> 4. What is the best connector for hbase which gives this functionality?
>>>>
>>>> Best
>>>>
>>>> Ayan
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>>
>> --
>> ------------------------------------------------
>> Thanks!
>> Tao
>>
>


-- 
------------------------------------------------
Thanks!
Tao

Reply via email to