Re: How to merge two large table and remove duplicates?

Gavin Yue Sat, 09 Jan 2016 00:10:48 -0800

I saw in the document, the value is LZO.    Is it LZO or LZ4?

https://github.com/Cyan4973/lz4


Based on this benchmark, they differ quite a lot.



On Fri, Jan 8, 2016 at 9:55 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> gzip is relatively slow. It consumes much CPU.
>
> snappy is faster.
>
> LZ4 is faster than GZIP and smaller than Snappy.
>
> Cheers
>
> On Fri, Jan 8, 2016 at 7:56 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>
>> Thank you .
>>
>> And speaking of compression, is there big difference on performance
>> between gzip and snappy? And why parquet is using gzip by default?
>>
>> Thanks.
>>
>>
>> On Fri, Jan 8, 2016 at 6:39 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Cycling old bits:
>>> http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ
>>>
>>> Gavin:
>>> Which release of hbase did you play with ?
>>>
>>> HBase has been evolving and is getting more stable.
>>>
>>> Cheers
>>>
>>> On Fri, Jan 8, 2016 at 6:29 PM, Gavin Yue <yue.yuany...@gmail.com>
>>> wrote:
>>>
>>>> I used to maintain a HBase cluster. The experience with it was not
>>>> happy.
>>>>
>>>> I just tried query the data  from each day's first and dedup with
>>>> smaller set, the performance is acceptable.  So I guess I will use this
>>>> method.
>>>>
>>>> Again, could anyone give advice about:
>>>>
>>>>    - Automatically determine the number of reducers for joins and
>>>>    groupbys: Currently in Spark SQL, you need to control the degree of
>>>>    parallelism post-shuffle using “SET
>>>>    spark.sql.shuffle.partitions=[num_tasks];”.
>>>>
>>>> Thanks.
>>>>
>>>> Gavin
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jan 8, 2016 at 6:25 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> bq. in an noSQL db such as Hbase
>>>>>
>>>>> +1 :-)
>>>>>
>>>>>
>>>>> On Fri, Jan 8, 2016 at 6:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> One option you may want to explore is writing event table in an noSQL
>>>>>> db such as Hbase. One inherent problem in your approach is you always 
>>>>>> need
>>>>>> to load either full data set or a defined number of partitions to see if
>>>>>> the event has already come (and no gurantee it is full proof, but lead to
>>>>>> unnecessary loading in most cases).
>>>>>>
>>>>>> On Sat, Jan 9, 2016 at 12:56 PM, Gavin Yue <yue.yuany...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey,
>>>>>>> Thank you for the answer. I checked the setting you mentioend they
>>>>>>> are all correct.  I noticed that in the job, there are always only 200
>>>>>>> reducers for shuffle read, I believe it is setting in the sql shuffle
>>>>>>> parallism.
>>>>>>>
>>>>>>> In the doc, it mentions:
>>>>>>>
>>>>>>>    - Automatically determine the number of reducers for joins and
>>>>>>>    groupbys: Currently in Spark SQL, you need to control the degree of
>>>>>>>    parallelism post-shuffle using “SET
>>>>>>>    spark.sql.shuffle.partitions=[num_tasks];”.
>>>>>>>
>>>>>>>
>>>>>>> What would be the ideal number for this setting? Is it based on the
>>>>>>> hardware of cluster?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Gavin
>>>>>>>
>>>>>>> On Fri, Jan 8, 2016 at 2:48 PM, Benyi Wang <bewang.t...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>    - I assume your parquet files are compressed. Gzip or Snappy?
>>>>>>>>    - What spark version did you use? It seems at least 1.4. If you
>>>>>>>>    use spark-sql and tungsten, you might have better performance. but 
>>>>>>>> spark
>>>>>>>>    1.5.2 gave me a wrong result when the data was about 300~400GB, 
>>>>>>>> just for a
>>>>>>>>    simple group-by and aggregate.
>>>>>>>>    - Did you use kyro serialization?
>>>>>>>>    - you should have spark.shuffle.compress=true, verify it.
>>>>>>>>    - How many tasks did you use? spark.default.parallelism=?
>>>>>>>>    - What about this:
>>>>>>>>       - Read the data day by day
>>>>>>>>       - compute a bucket id from timestamp, e.g., the date and hour
>>>>>>>>       - Write into different buckets (you probably need a special
>>>>>>>>       writer to write data efficiently without shuffling the data).
>>>>>>>>       - distinct for each bucket. Because each bucket is small,
>>>>>>>>       spark can get it done faster than having everything in one run.
>>>>>>>>       - I think using groupBy (userId, timestamp) might be better
>>>>>>>>       than distinct. I guess distinct() will compare every field.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jan 8, 2016 at 2:31 PM, Gavin Yue <yue.yuany...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> And the most frequent operation I am gonna do is find the UserID
>>>>>>>>> who have some events, then retrieve all the events associted with the
>>>>>>>>> UserID.
>>>>>>>>>
>>>>>>>>> In this case, how should I partition to speed up the process?
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue <yue.yuany...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> hey Ted,
>>>>>>>>>>
>>>>>>>>>> Event table is like this: UserID, EventType, EventKey, TimeStamp,
>>>>>>>>>> MetaData.  I just parse it from Json and save as Parquet, did not 
>>>>>>>>>> change
>>>>>>>>>> the partition.
>>>>>>>>>>
>>>>>>>>>> Annoyingly, every day's incoming Event data having duplicates
>>>>>>>>>> among each other.  One same event could show up in Day1 and Day2 and
>>>>>>>>>> probably Day3.
>>>>>>>>>>
>>>>>>>>>> I only want to keep single Event table and each day it come so
>>>>>>>>>> many duplicates.
>>>>>>>>>>
>>>>>>>>>> Is there a way I could just insert into Parquet and if duplicate
>>>>>>>>>> found, just ignore?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Gavin
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jan 8, 2016 at 2:18 PM, Ted Yu <yuzhih...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Is your Parquet data source partitioned by date ?
>>>>>>>>>>>
>>>>>>>>>>> Can you dedup within partitions ?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue <
>>>>>>>>>>> yue.yuany...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I tried on Three day's data.  The total input is only 980GB,
>>>>>>>>>>>> but the shuffle write Data is about 6.2TB, then the job failed 
>>>>>>>>>>>> during
>>>>>>>>>>>> shuffle read step, which should be another 6.2TB shuffle read.
>>>>>>>>>>>>
>>>>>>>>>>>> I think to Dedup, the shuffling can not be avoided. Is there
>>>>>>>>>>>> anything I could do to stablize this process?
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jan 8, 2016 at 2:04 PM, Gavin Yue <
>>>>>>>>>>>> yue.yuany...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I got everyday's Event table and want to merge them into a
>>>>>>>>>>>>> single Event table. But there so many duplicates among each day's 
>>>>>>>>>>>>> data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I use Parquet as the data source.  What I am doing now is
>>>>>>>>>>>>>
>>>>>>>>>>>>> EventDay1.unionAll(EventDay2).distinct().write.parquet("a new
>>>>>>>>>>>>> parquet file").
>>>>>>>>>>>>>
>>>>>>>>>>>>> Each day's Event is stored in their own Parquet file
>>>>>>>>>>>>>
>>>>>>>>>>>>> But it failed at the stage2 which keeps losing connection to
>>>>>>>>>>>>> one executor. I guess this is due to the memory issue.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any suggestion how I do this efficiently?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Gavin
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to merge two large table and remove duplicates?

Reply via email to