Re: RDD blocks on Spark Driver

2017-02-28 Thread Prithish
This is the command I am running:

spark-submit --deploy-mode cluster --master yarn --class com.myorg.myApp
s3://my-bucket/myapp-0.1.jar

On Wed, Mar 1, 2017 at 12:22 AM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Prithish,
>
> It would be helpful for you to share the spark-submit command you are
> running.
>
> ~ Jonathan
>
> On Sun, Feb 26, 2017 at 8:29 AM Prithish <prith...@gmail.com> wrote:
>
>> Thanks for the responses, I am running this on Amazon EMR which runs the
>> Yarn cluster manager.
>>
>> On Sat, Feb 25, 2017 at 4:45 PM, liangyhg...@gmail.com <
>> liangyhg...@gmail.com> wrote:
>>
>> Hi,
>>  I think you are using the local model of Spark. There
>> are mainly four models, which are local, standalone,  yarn
>> and Mesos. Also, "blocks" is relative to hdfs, "partitions"
>>  is relative to spark.
>>
>> liangyihuai
>>
>> ---Original---
>> *From:* "Jacek Laskowski "<ja...@japila.pl>
>> *Date:* 2017/2/25 02:45:20
>> *To:* "prithish"<prith...@gmail.com>;
>> *Cc:* "user"<user@spark.apache.org>;
>> *Subject:* Re: RDD blocks on Spark Driver
>>
>> Hi,
>>
>> Guess you're use local mode which has only one executor called driver. Is
>> my guessing correct?
>>
>> Jacek
>>
>> On 23 Feb 2017 2:03 a.m., <prith...@gmail.com> wrote:
>>
>> Hello,
>>
>> Had a question. When I look at the executors tab in Spark UI, I notice
>> that some RDD blocks are assigned to the driver as well. Can someone please
>> tell me why?
>>
>> Thanks for the help.
>>
>>
>>


Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Prithish
Thanks for your response Jonathan. Yes, this works. I also added another
way of achieving this to the Stackoverflow post. Thanks for the help.

On Tue, Feb 28, 2017 at 11:58 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Prithish,
>
> I saw you posted this on SO, so I responded there just now. See
> http://stackoverflow.com/questions/42452622/custom-
> log4j-properties-on-aws-emr/42516161#42516161
>
> In short, an hdfs:// path can't be used to configure log4j because log4j
> knows nothing about hdfs. Instead, since you are using EMR, you should use
> the Configuration API when creating your cluster to configure the
> spark-log4j configuration classification. See http://docs.aws.amazon.
> com/emr/latest/ReleaseGuide/emr-configure-apps.html for more info.
>
> ~ Jonathan
>
> On Sun, Feb 26, 2017 at 8:17 PM Prithish <prith...@gmail.com> wrote:
>
>> Steve, I tried that, but didn't work. Any other ideas?
>>
>> On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>> try giving a resource of a file in the JAR, e.g add a file
>> "log4j-debugging.properties into the jar, and give a config option of
>> -Dlog4j.configuration=/log4j-debugging.properties   (maybe also try
>> without the "/")
>>
>>
>> On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote:
>>
>> Hoping someone can answer this.
>>
>> I am unable to override and use a Custom log4j.properties on Amazon EMR.
>> I am running Spark on EMR (Yarn) and have tried all the below combinations
>> in the Spark-Submit to try and use the custom log4j.
>>
>> In Client mode
>> --driver-java-options "-Dlog4j.configuration=hdfs://
>> host:port/user/hadoop/log4j.properties"
>>
>> In Cluster mode
>> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:
>> port/user/hadoop/log4j.properties"
>>
>> I have also tried picking from local filesystem using file: instead
>> of hdfs. None of this seem to work. However, I can get this working when
>> running on my local Yarn setup.
>>
>> Any ideas?
>>
>> I have also posted on Stackoverflow (link below)
>> http://stackoverflow.com/questions/42452622/custom-
>> log4j-properties-on-aws-emr
>>
>>
>>
>>


Re: Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Steve, I tried that, but didn't work. Any other ideas?

On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com>
wrote:

> try giving a resource of a file in the JAR, e.g add a file
> "log4j-debugging.properties into the jar, and give a config option of
> -Dlog4j.configuration=/log4j-debugging.properties   (maybe also try
> without the "/")
>
>
> On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote:
>
> Hoping someone can answer this.
>
> I am unable to override and use a Custom log4j.properties on Amazon EMR. I
> am running Spark on EMR (Yarn) and have tried all the below combinations in
> the Spark-Submit to try and use the custom log4j.
>
> In Client mode
> --driver-java-options "-Dlog4j.configuration=hdfs://
> host:port/user/hadoop/log4j.properties"
>
> In Cluster mode
> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:
> port/user/hadoop/log4j.properties"
>
> I have also tried picking from local filesystem using file: instead
> of hdfs. None of this seem to work. However, I can get this working when
> running on my local Yarn setup.
>
> Any ideas?
>
> I have also posted on Stackoverflow (link below)
> http://stackoverflow.com/questions/42452622/custom-
> log4j-properties-on-aws-emr
>
>
>


Custom log4j.properties on AWS EMR

2017-02-26 Thread Prithish
Hoping someone can answer this.

I am unable to override and use a Custom log4j.properties on Amazon EMR. I
am running Spark on EMR (Yarn) and have tried all the below combinations in
the Spark-Submit to try and use the custom log4j.

In Client mode
--driver-java-options
"-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"

In Cluster mode
--conf
"spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties"

I have also tried picking from local filesystem using file: instead of
hdfs. None of this seem to work. However, I can get this working when
running on my local Yarn setup.

Any ideas?

I have also posted on Stackoverflow (link below)
http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr


Re: RDD blocks on Spark Driver

2017-02-26 Thread Prithish
Thanks for the responses, I am running this on Amazon EMR which runs the
Yarn cluster manager.

On Sat, Feb 25, 2017 at 4:45 PM, liangyhg...@gmail.com <
liangyhg...@gmail.com> wrote:

> Hi,
>  I think you are using the local model of Spark. There
> are mainly four models, which are local, standalone,  yarn
> and Mesos. Also, "blocks" is relative to hdfs, "partitions"
>  is relative to spark.
>
> liangyihuai
>
> ---Original---
> *From:* "Jacek Laskowski "<ja...@japila.pl>
> *Date:* 2017/2/25 02:45:20
> *To:* "prithish"<prith...@gmail.com>;
> *Cc:* "user"<user@spark.apache.org>;
> *Subject:* Re: RDD blocks on Spark Driver
>
> Hi,
>
> Guess you're use local mode which has only one executor called driver. Is
> my guessing correct?
>
> Jacek
>
> On 23 Feb 2017 2:03 a.m., <prith...@gmail.com> wrote:
>
>> Hello,
>>
>> Had a question. When I look at the executors tab in Spark UI, I notice
>> that some RDD blocks are assigned to the driver as well. Can someone please
>> tell me why?
>>
>> Thanks for the help.
>>
>


RDD blocks on Spark Driver

2017-02-22 Thread prithish
 
 
Hello,   
 
 
 

 
Had a question. When I look at the executors tab in Spark UI, I notice that 
some RDD blocks are assigned to the driver as well. Can someone please tell me 
why?
 

 
Thanks for the help.
 
 
 

 
 

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
It's something like the schema shown below (with several additional
levels/sublevels)

root
 |-- sentAt: long (nullable = true)
 |-- sharing: string (nullable = true)
 |-- receivedAt: long (nullable = true)
 |-- ip: string (nullable = true)
 |-- story: struct (nullable = true)
 ||-- super: string (nullable = true)
 ||-- lang: string (nullable = true)
 ||-- setting: string (nullable = true)
 ||-- myapp: struct (nullable = true)
 |||-- id: string (nullable = true)
 |||-- ver: string (nullable = true)
 |||-- build: string (nullable = true)
 ||-- comp: struct (nullable = true)
 |||-- notes: string (nullable = true)
 |||-- source: string (nullable = true)
 |||-- name: string (nullable = true)
 |||-- content: string (nullable = true)
 |||-- sub: string (nullable = true)
 ||-- loc: struct (nullable = true)
 |||-- city: string (nullable = true)
 |||-- country: string (nullable = true)
 |||-- lat: double (nullable = true)
 |||-- long: double (nullable = true)

On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> Hi,
>
> What's the schema interpreted by spark?
> A compression logic of the spark caching depends on column types.
>
> // maropu
>
>
> On Wed, Nov 16, 2016 at 5:26 PM, Prithish <prith...@gmail.com> wrote:
>
>> Thanks for your response.
>>
>> I did some more tests and I am seeing that when I have a flatter
>> structure for my AVRO, the cache memory use is close to the CSV. But, when
>> I use few levels of nesting, the cache memory usage blows up. This is
>> really critical for planning the cluster we will be using. To avoid using a
>> larger cluster, looks like, we will have to consider keeping the structure
>> flat as much as possible.
>>
>> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com>
>> wrote:
>>
>>> (Adding user@spark back to the discussion)
>>>
>>>
>>>
>>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of
>>> scope for compression. On the other hand avro and parquet are already
>>> compressed and just store extra schema info, afaik. Avro and parquet are
>>> both going to make your data smaller, parquet through compressed columnar
>>> storage, and avro through its binary data format.
>>>
>>>
>>>
>>> Next, talking about the 62kb becoming 1224kb. I actually do not see such
>>> a massive blow up. The avro you shared is 28kb on my system and becomes
>>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
>>> serialized. Exact same numbers with parquet as well. This is expected
>>> behavior, if I am not wrong.
>>>
>>>
>>>
>>> In fact, now that I think about it, even larger blow ups might be valid,
>>> since your data must have been deserialized from the compressed avro
>>> format, making it bigger. The order of magnitude of difference in size
>>> would depend on the type of data you have and how well it was compressable.
>>>
>>>
>>>
>>> The purpose of these formats is to store data to persistent storage in a
>>> way that's faster to read from, not to reduce cache-memory usage.
>>>
>>>
>>>
>>> Maybe others here have more info to share.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Shreya
>>>
>>>
>>>
>>> Sent from my Windows 10 phone
>>>
>>>
>>>
>>> *From: *Prithish <prith...@gmail.com>
>>> *Sent: *Tuesday, November 15, 2016 11:04 PM
>>> *To: *Shreya Agarwal <shrey...@microsoft.com>
>>> *Subject: *Re: AVRO File size when caching in-memory
>>>
>>>
>>> I did another test and noting my observations here. These were done with
>>> the same data in avro and csv formats.
>>>
>>> In AVRO, the file size on disk was 62kb and after caching, the in-memory
>>> size is 1224kb
>>> In CSV, the file size on disk was 690kb and after caching, the in-memory
>>> size is 290kb
>>>
>>> I'm guessing that the spark caching is not able to compress when the
>>> source is avro. Not sure if this is just my immature conclusion. Waiting to
>>> hear your observation.
>>>
>>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote:
>>>
>>>> Thanks for your response.
>>>>
>>>> I have attached the code (that I ran using the Spark-shell) as well as
>>>> a samp

Re: AVRO File size when caching in-memory

2016-11-16 Thread Prithish
Thanks for your response.

I did some more tests and I am seeing that when I have a flatter structure
for my AVRO, the cache memory use is close to the CSV. But, when I use few
levels of nesting, the cache memory usage blows up. This is really critical
for planning the cluster we will be using. To avoid using a larger cluster,
looks like, we will have to consider keeping the structure flat as much as
possible.

On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com>
wrote:

> (Adding user@spark back to the discussion)
>
>
>
> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope
> for compression. On the other hand avro and parquet are already compressed
> and just store extra schema info, afaik. Avro and parquet are both going to
> make your data smaller, parquet through compressed columnar storage, and
> avro through its binary data format.
>
>
>
> Next, talking about the 62kb becoming 1224kb. I actually do not see such a
> massive blow up. The avro you shared is 28kb on my system and becomes
> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
> serialized. Exact same numbers with parquet as well. This is expected
> behavior, if I am not wrong.
>
>
>
> In fact, now that I think about it, even larger blow ups might be valid,
> since your data must have been deserialized from the compressed avro
> format, making it bigger. The order of magnitude of difference in size
> would depend on the type of data you have and how well it was compressable.
>
>
>
> The purpose of these formats is to store data to persistent storage in a
> way that's faster to read from, not to reduce cache-memory usage.
>
>
>
> Maybe others here have more info to share.
>
>
>
> Regards,
>
> Shreya
>
>
>
> Sent from my Windows 10 phone
>
>
>
> *From: *Prithish <prith...@gmail.com>
> *Sent: *Tuesday, November 15, 2016 11:04 PM
> *To: *Shreya Agarwal <shrey...@microsoft.com>
> *Subject: *Re: AVRO File size when caching in-memory
>
>
> I did another test and noting my observations here. These were done with
> the same data in avro and csv formats.
>
> In AVRO, the file size on disk was 62kb and after caching, the in-memory
> size is 1224kb
> In CSV, the file size on disk was 690kb and after caching, the in-memory
> size is 290kb
>
> I'm guessing that the spark caching is not able to compress when the
> source is avro. Not sure if this is just my immature conclusion. Waiting to
> hear your observation.
>
> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote:
>
>> Thanks for your response.
>>
>> I have attached the code (that I ran using the Spark-shell) as well as a
>> sample avro file. After you run this code, the data is cached in memory and
>> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see
>> the size it uses. In this example the size is small, but in my actual
>> scenario, the source file size is 30GB and the in-memory size comes to
>> around 800GB. I am trying to understand if this is expected when using avro
>> or not.
>>
>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shrey...@microsoft.com>
>> wrote:
>>
>>> I haven’t used Avro ever. But if you can send over a quick sample code,
>>> I can run and see if I repro it and maybe debug.
>>>
>>>
>>>
>>> *From:* Prithish [mailto:prith...@gmail.com]
>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>> *To:* Jörn Franke <jornfra...@gmail.com>
>>> *Cc:* User <user@spark.apache.org>
>>> *Subject:* Re: AVRO File size when caching in-memory
>>>
>>>
>>>
>>> Anyone?
>>>
>>>
>>>
>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote:
>>>
>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
>>> the latest AWS EMR release.
>>>
>>>
>>>
>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com>
>>> wrote:
>>>
>>> spark version? Are you using tungsten?
>>>
>>>
>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote:
>>> >
>>> > Can someone please explain why this happens?
>>> >
>>> > When I read a 600kb AVRO file and cache this in memory (using
>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>> this with different file sizes, and the size in-memory is always
>>> proportionate. I thought Spark compresses when using cacheTable.
>>>
>>>
>>>
>>>
>>>
>>
>>
>


Re: AVRO File size when caching in-memory

2016-11-15 Thread Prithish
Anyone?

On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote:

> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
> the latest AWS EMR release.
>
> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> wrote:
>
>> spark version? Are you using tungsten?
>>
>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote:
>> >
>> > Can someone please explain why this happens?
>> >
>> > When I read a 600kb AVRO file and cache this in memory (using
>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>> this with different file sizes, and the size in-memory is always
>> proportionate. I thought Spark compresses when using cacheTable.
>>
>
>


Re: AVRO File size when caching in-memory

2016-11-14 Thread Prithish
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on
the latest AWS EMR release.

On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> spark version? Are you using tungsten?
>
> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote:
> >
> > Can someone please explain why this happens?
> >
> > When I read a 600kb AVRO file and cache this in memory (using
> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
> this with different file sizes, and the size in-memory is always
> proportionate. I thought Spark compresses when using cacheTable.
>


AVRO File size when caching in-memory

2016-11-14 Thread Prithish
Can someone please explain why this happens?

When I read a 600kb AVRO file and cache this in memory (using cacheTable),
it shows up as 11mb (storage tab in Spark UI). I have tried this with
different file sizes, and the size in-memory is always proportionate. I
thought Spark compresses when using cacheTable.


Re: Reading AVRO from S3 - No parallelism

2016-10-27 Thread prithish
 
 
The Avro files were 500-600kb in size and that folder contained around 1200 
files. The total folder size was around 600mb. Will try repartition. Thank you.
 
   
 
 
 

 
 
 
 

 
 
>  
> On Oct 28, 2016 at 2:24 AM,   (mailto:mich...@databricks.com)>  wrote:
>  
>  
>  
> How big are your avro files?We collapse many small files into a single 
> partition to eliminate scheduler overhead.If you need explicit 
> parallelism you can also repartition.
>  
>
>  
> On Thu, Oct 27, 2016 at 5:19 AM, Prithish  <prith...@gmail.com 
> (mailto:prith...@gmail.com)>  wrote:
>  
> >  
> >  
> > I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. 
> > No matter how many executors I use or what configuration changes I make, 
> > the cluster doesn't seem to use all the executors. I am using the 
> > com.databricks.spark.avro library from databricks to read the AVRO.  
> >  
> >
> >  
> > However, if I try the same on CSV files (same S3 folder, same configuration 
> > and cluster), it does use all executors.  
> >  
> >
> >  
> > Is there something that I need to do to enable parallelism when using the 
> > AVRO databricks library?
> >  
> >
> >  
> > Thanks for your help.  
> >  
> >
> >  
> >
> >
>  

Reading AVRO from S3 - No parallelism

2016-10-27 Thread Prithish
I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0.
No matter how many executors I use or what configuration changes I make,
the cluster doesn't seem to use all the executors. I am using the
com.databricks.spark.avro library from databricks to read the AVRO.

However, if I try the same on CSV files (same S3 folder, same configuration
and cluster), it does use all executors.

Is there something that I need to do to enable parallelism when using the
AVRO databricks library?

Thanks for your help.


Question about In-Memory size (cache / cacheTable)

2016-10-26 Thread Prithish
Hello,

I am trying to understand how in-memory size is changing in these
situations. Specifically, why is in-memory size much higher for avro and
parquet? Are there any optimizations necessary to reduce this?

Used cacheTable on each of these:

AVRO File (600kb) - In-memory size was 12mb
Parquet File (600kb) - In-memory size was 12mb
CSV File (3mb, was the same file as above) - In-memory size was 600Kb

Because of this, we need a cluster with a much bigger memory if we were to
cache the avro files.

Thanks for your help.

Prit