Re: Are tachyon and akka removed from 2.1.1 please

2017-05-22 Thread Chin Wei Low
I think akka has been removed since 2.0.

On 22 May 2017 10:19 pm, "Gene Pang"  wrote:

> Hi,
>
> Tachyon has been renamed to Alluxio. Here is the documentation for
> running Alluxio with Spark
> .
>
> Hope this helps,
> Gene
>
> On Sun, May 21, 2017 at 6:15 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
>
>> HI all,
>> Iread some paper about source code, the paper base on version 1.2.  they
>> refer the tachyon and akka.  When i read the 2.1code. I can not find the
>> code abiut akka and tachyon.
>>
>> Are tachyon and akka removed from 2.1.1  please
>>
>
>


Re: Spark app write too many small parquet files

2016-11-28 Thread Chin Wei Low
Try limit the partitions. spark.sql.shuffle.partitions

This control the number of files generated.

On 28 Nov 2016 8:29 p.m., "Kevin Tran"  wrote:

> Hi Denny,
> Thank you for your inputs. I also use 128 MB but still too many files
> generated by Spark app which is only ~14 KB each ! That's why I'm asking if
> there is a solution for this if some one has same issue.
>
> Cheers,
> Kevin.
>
> On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee  wrote:
>
>> Generally, yes - you should try to have larger data sizes due to the
>> overhead of opening up files.  Typical guidance is between 64MB-1GB;
>> personally I usually stick with 128MB-512MB with the default of snappy
>> codec compression with parquet.  A good reference is Vida Ha's presentation 
>> Data
>> Storage Tips for Optimal Spark Performance
>> .
>>
>>
>> On Sun, Nov 27, 2016 at 9:44 PM Kevin Tran  wrote:
>>
>>> Hi Everyone,
>>> Does anyone know what is the best practise of writing parquet file from
>>> Spark ?
>>>
>>> As Spark app write data to parquet and it shows that under that
>>> directory there are heaps of very small parquet file (such as
>>> e73f47ef-4421-4bcc-a4db-a56b110c3089.parquet). Each parquet file is
>>> only 15KB
>>>
>>> Should it write each chunk of  bigger data size (such as 128 MB) with
>>> proper number of files ?
>>>
>>> Does anyone find out any performance changes when changing data size of
>>> each parquet file ?
>>>
>>> Thanks,
>>> Kevin.
>>>
>>
>


Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-25 Thread Chin Wei Low
Hi Kazuaki,

I print a debug log right before I call the collect, and use that to
compare against the job start log (it is available when turning on debug
log).
Anyway, I test that in Spark 2.0.1 and never see it happen. But, the query
on cached dataframe is still slightly slower than the one without cached
when it is running on Spark 2.0.1.

Regards,
Low Chin Wei

On Tue, Oct 25, 2016 at 3:39 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
wrote:

> Hi Chin Wei,
> I am sorry for being late to reply.
>
> Got it. Interesting behavior. How did you measure the time between 1st and
> 2nd events?
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Chin Wei Low <lowchin...@gmail.com>
> To:Kazuaki Ishizaki/Japan/IBM@IBMJP
> Cc:user@spark.apache.org
> Date:2016/10/10 11:33
>
> Subject:Re: Spark SQL is slower when DataFrame is cache in Memory
> --
>
>
>
> Hi Ishizaki san,
>
> Thanks for the reply.
>
> So, when I pre-cache the dataframe, the cache is being used during the job
> execution.
>
> Actually there are 3 events:
> 1. call res.collect
> 2. job started
> 3. job completed
>
> I am concerning about the longer time taken between 1st and 2nd events. It
> seems like the query planning and optimization is longer when query on
> cached dataframe.
>
>
> Regards,
> Chin Wei
>
> On Fri, Oct 7, 2016 at 10:14 PM, Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> <ishiz...@jp.ibm.com>> wrote:
> Hi Chin Wei,
> Yes, since you force to create a cache by executing df.count, Spark starts
> to get data from cache for the following task:
> val res = sqlContext.sql("table1 union table2 union table3")
> res.collect()
>
> If you insert 'res.explain', you can confirm which resource you use to get
> data, cache or parquet?
> val res = sqlContext.sql("table1 union table2 union table3")
> res.explain(true)
> res.collect()
>
> Do I make some misunderstandings?
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Chin Wei Low <*lowchin...@gmail.com* <lowchin...@gmail.com>>
> To:Kazuaki Ishizaki/Japan/IBM@IBMJP
> Cc:*user@spark.apache.org* <user@spark.apache.org>
> Date:2016/10/07 20:06
> Subject:Re: Spark SQL is slower when DataFrame is cache in Memory
>
> --
>
>
>
> Hi Ishizaki san,
>
> So there is a gap between res.collect
> and when I see this log:   spark.SparkContext: Starting job: collect at
> :26
>
> What you mean is, during this time Spark already start to get data from
> cache? Isn't it should only get the data after the job is started and tasks
> are distributed?
>
> Regards,
> Chin Wei
>
>
> On Fri, Oct 7, 2016 at 3:43 PM, Kazuaki Ishizaki <*ishiz...@jp.ibm.com*
> <ishiz...@jp.ibm.com>> wrote:
> Hi,
> I think that the result looks correct. The current Spark spends extra time
> for getting data from a cache. There are two reasons. One is for a
> complicated path to get a data. The other is for decompression in the case
> of a primitive type.
> The new implementation (*https://github.com/apache/spark/pull/15219*
> <https://github.com/apache/spark/pull/15219>) is ready for review. It
> would achieve 1.2x performance improvement for a compressed column and much
> performance improvement for an uncompressed column.
>
> Best Regards,
> Kazuaki Ishizaki
>
>
>
> From:Chin Wei Low <*lowchin...@gmail.com* <lowchin...@gmail.com>>
> To:*user@spark.apache.org* <user@spark.apache.org>
> Date:2016/10/07 13:05
> Subject:Spark SQL is slower when DataFrame is cache in Memory
> --
>
>
>
>
> Hi,
>
> I am using Spark 1.6.0. I have a Spark application that create and cache
> (in memory) DataFrames (around 50+, with some on single parquet file and
> some on folder with a few parquet files) with the following codes:
>
> val df = sqlContext.read.parquet
> df.persist
> df.count
>
> I union them to 3 DataFrames and register that as temp table.
>
> Then, run the following codes:
> val res = sqlContext.sql("table1 union table2 union table3")
> res.collect()
>
> The res.collect() is slower when I cache the DataFrame compare to without
> cache. e.g. 3 seconds vs 1 second
>
> I turn on the DEBUG log and see there is a gap from the res.collect() to
> start the Spark job.
>
> Is the extra time taken by the query planning & optimization? It does not
> show the gap when I do not cache the dataframe.
>
> Anything I am missing here?
>
> Regards,
> Chin Wei
>
>
>
>
>
>


Re: [Spark] RDDs are not persisting in memory

2016-10-10 Thread Chin Wei Low
Hi,

Your RDD is 5GB, perhaps it is too large to fit into executor's storage
memory. You can refer to the Executors tab in Spark UI to check the
available memory for storage for each of the executor.

Regards,
Chin Wei

On Tue, Oct 11, 2016 at 6:14 AM, diplomatic Guru 
wrote:

> Hello team,
>
> Spark version: 1.6.0
>
> I'm trying to persist done data into memory for reusing them. However,
> when I call rdd.cache() OR  rdd.persist(StorageLevel.MEMORY_ONLY())  it
> does not store the data as I can not see any rdd information under WebUI
> (Storage Tab).
>
> Therefore I tried rdd.persist(StorageLevel.MEMORY_AND_DISK()), for which
> it stored the data into Disk only as shown in below screenshot:
>
> [image: Inline images 2]
>
> Do you know why the memory is not being used?
>
> Is there a configuration in cluster level to stop jobs from storing data
> into memory altogether?
>
>
> Please let me know.
>
> Thanks
>
> Guru
>
>