Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Ted Yu
See the first half of this wiki: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LZO > On Jan 9, 2016, at 1:02 AM, Gavin Yue wrote: > > So I tried to set the parquet compression codec to lzo, but hadoop does not > have the lzo natives, while lz4 does included. > But I could se

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Sayan Sanyal
Unsubscribe Sent from Outlook Mobile _ From: Gavin Yue Sent: Saturday, January 9, 2016 14:33 Subject: Re: How to merge two large table and remove duplicates? To: Ted Yu Cc: Benyi Wang , user , ayan guha So I tried to set the

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue
So I tried to set the parquet compression codec to lzo, but hadoop does not have the lzo natives, while lz4 does included. But I could set the code to lz4, it only accepts lzo. Any solution here? Thank, Gavin On Sat, Jan 9, 2016 at 12:09 AM, Gavin Yue wrote: > I saw in the document, the valu

Re: How to merge two large table and remove duplicates?

2016-01-09 Thread Gavin Yue
I saw in the document, the value is LZO.Is it LZO or LZ4? https://github.com/Cyan4973/lz4 Based on this benchmark, they differ quite a lot. On Fri, Jan 8, 2016 at 9:55 PM, Ted Yu wrote: > gzip is relatively slow. It consumes much CPU. > > snappy is faster. > > LZ4 is faster than GZIP and

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Benyi Wang
Just try to give 1000, even 2000 to see if it works. If your see something like "Lost Executor", you'd better to stop your job, otherwise you are wasting time. Usually the container of the lost executor is killed by NodeManager because there is not enough memory. You can check NodeManager's log to

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
gzip is relatively slow. It consumes much CPU. snappy is faster. LZ4 is faster than GZIP and smaller than Snappy. Cheers On Fri, Jan 8, 2016 at 7:56 PM, Gavin Yue wrote: > Thank you . > > And speaking of compression, is there big difference on performance > between gzip and snappy? And why pa

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Thank you . And speaking of compression, is there big difference on performance between gzip and snappy? And why parquet is using gzip by default? Thanks. On Fri, Jan 8, 2016 at 6:39 PM, Ted Yu wrote: > Cycling old bits: > http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ > > Gavin: > Which release

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Cycling old bits: http://search-hadoop.com/m/q3RTtRuvrm1CGzBJ Gavin: Which release of hbase did you play with ? HBase has been evolving and is getting more stable. Cheers On Fri, Jan 8, 2016 at 6:29 PM, Gavin Yue wrote: > I used to maintain a HBase cluster. The experience with it was not happ

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
I used to maintain a HBase cluster. The experience with it was not happy. I just tried query the data from each day's first and dedup with smaller set, the performance is acceptable. So I guess I will use this method. Again, could anyone give advice about: - Automatically determine the numb

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
bq. in an noSQL db such as Hbase +1 :-) On Fri, Jan 8, 2016 at 6:25 PM, ayan guha wrote: > One option you may want to explore is writing event table in an noSQL db > such as Hbase. One inherent problem in your approach is you always need to > load either full data set or a defined number of pa

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread ayan guha
One option you may want to explore is writing event table in an noSQL db such as Hbase. One inherent problem in your approach is you always need to load either full data set or a defined number of partitions to see if the event has already come (and no gurantee it is full proof, but lead to unneces

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Hey, Thank you for the answer. I checked the setting you mentioend they are all correct. I noticed that in the job, there are always only 200 reducers for shuffle read, I believe it is setting in the sql shuffle parallism. In the doc, it mentions: - Automatically determine the number of reduc

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Benyi: bq. spark 1.5.2 gave me a wrong result when the data was about 300~400GB, just for a simple group-by and aggregate Can you reproduce the above using Spark 1.6.0 ? Thanks On Fri, Jan 8, 2016 at 2:48 PM, Benyi Wang wrote: > >- I assume your parquet files are compressed. Gzip or Snapp

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Benyi Wang
- I assume your parquet files are compressed. Gzip or Snappy? - What spark version did you use? It seems at least 1.4. If you use spark-sql and tungsten, you might have better performance. but spark 1.5.2 gave me a wrong result when the data was about 300~400GB, just for a simple gro

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
And the most frequent operation I am gonna do is find the UserID who have some events, then retrieve all the events associted with the UserID. In this case, how should I partition to speed up the process? Thanks. On Fri, Jan 8, 2016 at 2:29 PM, Gavin Yue wrote: > hey Ted, > > Event table is li

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
hey Ted, Event table is like this: UserID, EventType, EventKey, TimeStamp, MetaData. I just parse it from Json and save as Parquet, did not change the partition. Annoyingly, every day's incoming Event data having duplicates among each other. One same event could show up in Day1 and Day2 and pro

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Ted Yu
Is your Parquet data source partitioned by date ? Can you dedup within partitions ? Cheers On Fri, Jan 8, 2016 at 2:10 PM, Gavin Yue wrote: > I tried on Three day's data. The total input is only 980GB, but the > shuffle write Data is about 6.2TB, then the job failed during shuffle read > step

Re: How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
I tried on Three day's data. The total input is only 980GB, but the shuffle write Data is about 6.2TB, then the job failed during shuffle read step, which should be another 6.2TB shuffle read. I think to Dedup, the shuffling can not be avoided. Is there anything I could do to stablize this proces

How to merge two large table and remove duplicates?

2016-01-08 Thread Gavin Yue
Hey, I got everyday's Event table and want to merge them into a single Event table. But there so many duplicates among each day's data. I use Parquet as the data source. What I am doing now is EventDay1.unionAll(EventDay2).distinct().write.parquet("a new parquet file"). Each day's Event is sto

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Gavin Yue
Each folder should have no dups. Dups only exist among different folders. The logic inside is that only take the longest string value for each key. The current problem is exceeding the largest frame size when trying to write to hdfs, which is 500m which setting is 80m. Sent from my iPhone

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread Josh Rosen
ength) > {a} > else > {b} > } > ) > nodups.saveAsTextFile("/nodups") > > Anything I could do to make this process faster? Right now my process > dies > when output the data to the HDFS. > > > Thank you ! > > > > -- > V

Re: What is most efficient to do a large union and remove duplicates?

2015-06-14 Thread ayan guha
Can you do dedupe process locally for each file first and then globally? Also I did not fully get the logic of the part inside reducebykey. Can you kindly explain? On 14 Jun 2015 13:58, "Gavin Yue" wrote: > I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So > totally 5TB da

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
I have 10 folder, each with 6000 files. Each folder is roughly 500GB. So totally 5TB data. The data is formatted as key t/ value. After union, I want to remove the duplicates among keys. So each key should be unique and has only one value. Here is what I am doing. folders = Array("folder1"

What is most efficient to do a large union and remove duplicates?

2015-06-13 Thread Gavin Yue
S. Thank you ! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-most-efficient-to-do-a-large-union-and-remove-duplicates-tp23303.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

remove duplicates

2014-03-24 Thread Adrian Mocanu
I have a DStream like this: ..RDD[a,b],RDD[b,c].. Is there a way to remove duplicates across the entire DStream? Ie: I would like the output to be (by removing one of the b's): ..RDD[a],RDD[b,c].. or ..RDD[a,b],RDD[c].. Thanks -Adrian