Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
Thank you, that works. ** *Sincerely yours,* *Raymond* On Tue, Jun 19, 2018 at 4:36 PM, Nicolas Paris wrote: > Hi Raymond > > Spark works well on single machine too, since it benefits from multiple > core. > The csv parser is based on

Re: Best way to process this dataset

2018-06-19 Thread Nicolas Paris
Hi Raymond Spark works well on single machine too, since it benefits from multiple core. The csv parser is based on univocity and you might use the "spark.read.csc" syntax instead of using the rdd api; >From my experience, this will better than any other csv parser 2018-06-19 16:43 GMT+02:00

Re: Best way to process this dataset

2018-06-19 Thread Raymond Xie
Thank you Matteo, Askash and Georg: I am attempting to get some stats first, the data is like: 1,4152983,2355072,pv,1511871096 I like to find out the count of Key of (UserID, Behavior Type) val bh_count =

Re: Best way to process this dataset

2018-06-19 Thread Matteo Cossu
Single machine? Any other framework will perform better than Spark On Tue, 19 Jun 2018 at 09:40, Aakash Basu wrote: > Georg, just asking, can Pandas handle such a big dataset? If that data is > further passed into using any of the sklearn modules? > > On Tue, Jun 19, 2018 at 10:35 AM, Georg

Re: Best way to process this dataset

2018-06-19 Thread Aakash Basu
Georg, just asking, can Pandas handle such a big dataset? If that data is further passed into using any of the sklearn modules? On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler wrote: > use pandas or dask > > If you do want to use spark store the dataset as parquet / orc. And then > continue to

Re: Best way to process this dataset

2018-06-18 Thread Georg Heiler
use pandas or dask If you do want to use spark store the dataset as parquet / orc. And then continue to perform analytical queries on that dataset. Raymond Xie schrieb am Di., 19. Juni 2018 um 04:29 Uhr: > I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment > is 20GB ssd

Best way to process this dataset

2018-06-18 Thread Raymond Xie
I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my environment is 20GB ssd harddisk and 2GB RAM. The dataset comes with User ID: 987,994 Item ID: 4,162,024 Category ID: 9,439 Behavior type ('pv', 'buy', 'cart', 'fav') Unix Timestamp: span between November 25 to December 03, 2017 I