Re: Best way to process this dataset

Nicolas Paris Tue, 19 Jun 2018 13:36:19 -0700

Hi Raymond

Spark works well on single machine too, since it benefits from multiple
core.
The csv parser is based on univocity and you might use the
"spark.read.csc" syntax instead of using the rdd api;


>From my experience, this will better than any other csv  parser

2018-06-19 16:43 GMT+02:00 Raymond Xie <xie3208...@gmail.com>:

> Thank you Matteo, Askash and Georg:
>
> I am attempting to get some stats first, the data is like:
>
> 1,4152983,2355072,pv,1511871096
>
> I like to find out the count of Key of (UserID, Behavior Type)
>
> val bh_count = 
> sc.textFile("C:\\RXIE\\Learning\\Data\\Alibaba\\UserBehavior\\UserBehavior.csv").map(_.split(",")).map(x
>  => ((x(0).toInt,x(3)),1)).groupByKey()
>
> This shows me:
> scala> val first = bh_count.first
> [Stage 1:>                                                          (0 +
> 1) / 1]2018-06-19 10:41:19 WARN  Executor:66 - Managed memory leak
> detected; size = 15848112 bytes, TID = 110
> first: ((Int, String), Iterable[Int]) = ((878310,pv),CompactBuffer(1, 1,
> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
> 1, 1))
>
>
> *Note this environment is: Windows 7 with 32GB RAM. (I am firstly running
> it in Windows where I have more RAM instead of Ubuntu so the env differs to
> what I said in the original email)*
> *Dataset is 3.6GB*
>
> *Thank you very much.*
> *------------------------------------------------*
> *Sincerely yours,*
>
>
> *Raymond*
>
> On Tue, Jun 19, 2018 at 4:04 AM, Matteo Cossu <elco...@gmail.com> wrote:
>
>> Single machine? Any other framework will perform better than Spark
>>
>> On Tue, 19 Jun 2018 at 09:40, Aakash Basu <aakash.spark....@gmail.com>
>> wrote:
>>
>>> Georg, just asking, can Pandas handle such a big dataset? If that data
>>> is further passed into using any of the sklearn modules?
>>>
>>> On Tue, Jun 19, 2018 at 10:35 AM, Georg Heiler <
>>> georg.kf.hei...@gmail.com> wrote:
>>>
>>>> use pandas or dask
>>>>
>>>> If you do want to use spark store the dataset as parquet / orc. And
>>>> then continue to perform analytical queries on that dataset.
>>>>
>>>> Raymond Xie <xie3208...@gmail.com> schrieb am Di., 19. Juni 2018 um
>>>> 04:29 Uhr:
>>>>
>>>>> I have a 3.6GB csv dataset (4 columns, 100,150,807 rows), my
>>>>> environment is 20GB ssd harddisk and 2GB RAM.
>>>>>
>>>>> The dataset comes with
>>>>> User ID: 987,994
>>>>> Item ID: 4,162,024
>>>>> Category ID: 9,439
>>>>> Behavior type ('pv', 'buy', 'cart', 'fav')
>>>>> Unix Timestamp: span between November 25 to December 03, 2017
>>>>>
>>>>> I would like to hear any suggestion from you on how should I process
>>>>> the dataset with my current environment.
>>>>>
>>>>> Thank you.
>>>>>
>>>>> *------------------------------------------------*
>>>>> *Sincerely yours,*
>>>>>
>>>>>
>>>>> *Raymond*
>>>>>
>>>>
>>>
>

Re: Best way to process this dataset

Reply via email to