Re: RDD blocks on Spark Driver
This is the command I am running: spark-submit --deploy-mode cluster --master yarn --class com.myorg.myApp s3://my-bucket/myapp-0.1.jar On Wed, Mar 1, 2017 at 12:22 AM, Jonathan Kelly <jonathaka...@gmail.com> wrote: > Prithish, > > It would be helpful for you to share the spark-submit command you are > running. > > ~ Jonathan > > On Sun, Feb 26, 2017 at 8:29 AM Prithish <prith...@gmail.com> wrote: > >> Thanks for the responses, I am running this on Amazon EMR which runs the >> Yarn cluster manager. >> >> On Sat, Feb 25, 2017 at 4:45 PM, liangyhg...@gmail.com < >> liangyhg...@gmail.com> wrote: >> >> Hi, >> I think you are using the local model of Spark. There >> are mainly four models, which are local, standalone, yarn >> and Mesos. Also, "blocks" is relative to hdfs, "partitions" >> is relative to spark. >> >> liangyihuai >> >> ---Original--- >> *From:* "Jacek Laskowski "<ja...@japila.pl> >> *Date:* 2017/2/25 02:45:20 >> *To:* "prithish"<prith...@gmail.com>; >> *Cc:* "user"<user@spark.apache.org>; >> *Subject:* Re: RDD blocks on Spark Driver >> >> Hi, >> >> Guess you're use local mode which has only one executor called driver. Is >> my guessing correct? >> >> Jacek >> >> On 23 Feb 2017 2:03 a.m., <prith...@gmail.com> wrote: >> >> Hello, >> >> Had a question. When I look at the executors tab in Spark UI, I notice >> that some RDD blocks are assigned to the driver as well. Can someone please >> tell me why? >> >> Thanks for the help. >> >> >>
Re: Custom log4j.properties on AWS EMR
Thanks for your response Jonathan. Yes, this works. I also added another way of achieving this to the Stackoverflow post. Thanks for the help. On Tue, Feb 28, 2017 at 11:58 PM, Jonathan Kelly <jonathaka...@gmail.com> wrote: > Prithish, > > I saw you posted this on SO, so I responded there just now. See > http://stackoverflow.com/questions/42452622/custom- > log4j-properties-on-aws-emr/42516161#42516161 > > In short, an hdfs:// path can't be used to configure log4j because log4j > knows nothing about hdfs. Instead, since you are using EMR, you should use > the Configuration API when creating your cluster to configure the > spark-log4j configuration classification. See http://docs.aws.amazon. > com/emr/latest/ReleaseGuide/emr-configure-apps.html for more info. > > ~ Jonathan > > On Sun, Feb 26, 2017 at 8:17 PM Prithish <prith...@gmail.com> wrote: > >> Steve, I tried that, but didn't work. Any other ideas? >> >> On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com> >> wrote: >> >> try giving a resource of a file in the JAR, e.g add a file >> "log4j-debugging.properties into the jar, and give a config option of >> -Dlog4j.configuration=/log4j-debugging.properties (maybe also try >> without the "/") >> >> >> On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote: >> >> Hoping someone can answer this. >> >> I am unable to override and use a Custom log4j.properties on Amazon EMR. >> I am running Spark on EMR (Yarn) and have tried all the below combinations >> in the Spark-Submit to try and use the custom log4j. >> >> In Client mode >> --driver-java-options "-Dlog4j.configuration=hdfs:// >> host:port/user/hadoop/log4j.properties" >> >> In Cluster mode >> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host: >> port/user/hadoop/log4j.properties" >> >> I have also tried picking from local filesystem using file: instead >> of hdfs. None of this seem to work. However, I can get this working when >> running on my local Yarn setup. >> >> Any ideas? >> >> I have also posted on Stackoverflow (link below) >> http://stackoverflow.com/questions/42452622/custom- >> log4j-properties-on-aws-emr >> >> >> >>
Re: Custom log4j.properties on AWS EMR
Steve, I tried that, but didn't work. Any other ideas? On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran <ste...@hortonworks.com> wrote: > try giving a resource of a file in the JAR, e.g add a file > "log4j-debugging.properties into the jar, and give a config option of > -Dlog4j.configuration=/log4j-debugging.properties (maybe also try > without the "/") > > > On 26 Feb 2017, at 16:31, Prithish <prith...@gmail.com> wrote: > > Hoping someone can answer this. > > I am unable to override and use a Custom log4j.properties on Amazon EMR. I > am running Spark on EMR (Yarn) and have tried all the below combinations in > the Spark-Submit to try and use the custom log4j. > > In Client mode > --driver-java-options "-Dlog4j.configuration=hdfs:// > host:port/user/hadoop/log4j.properties" > > In Cluster mode > --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host: > port/user/hadoop/log4j.properties" > > I have also tried picking from local filesystem using file: instead > of hdfs. None of this seem to work. However, I can get this working when > running on my local Yarn setup. > > Any ideas? > > I have also posted on Stackoverflow (link below) > http://stackoverflow.com/questions/42452622/custom- > log4j-properties-on-aws-emr > > >
Custom log4j.properties on AWS EMR
Hoping someone can answer this. I am unable to override and use a Custom log4j.properties on Amazon EMR. I am running Spark on EMR (Yarn) and have tried all the below combinations in the Spark-Submit to try and use the custom log4j. In Client mode --driver-java-options "-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties" In Cluster mode --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=hdfs://host:port/user/hadoop/log4j.properties" I have also tried picking from local filesystem using file: instead of hdfs. None of this seem to work. However, I can get this working when running on my local Yarn setup. Any ideas? I have also posted on Stackoverflow (link below) http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr
Re: RDD blocks on Spark Driver
Thanks for the responses, I am running this on Amazon EMR which runs the Yarn cluster manager. On Sat, Feb 25, 2017 at 4:45 PM, liangyhg...@gmail.com < liangyhg...@gmail.com> wrote: > Hi, > I think you are using the local model of Spark. There > are mainly four models, which are local, standalone, yarn > and Mesos. Also, "blocks" is relative to hdfs, "partitions" > is relative to spark. > > liangyihuai > > ---Original--- > *From:* "Jacek Laskowski "<ja...@japila.pl> > *Date:* 2017/2/25 02:45:20 > *To:* "prithish"<prith...@gmail.com>; > *Cc:* "user"<user@spark.apache.org>; > *Subject:* Re: RDD blocks on Spark Driver > > Hi, > > Guess you're use local mode which has only one executor called driver. Is > my guessing correct? > > Jacek > > On 23 Feb 2017 2:03 a.m., <prith...@gmail.com> wrote: > >> Hello, >> >> Had a question. When I look at the executors tab in Spark UI, I notice >> that some RDD blocks are assigned to the driver as well. Can someone please >> tell me why? >> >> Thanks for the help. >> >
RDD blocks on Spark Driver
Hello, Had a question. When I look at the executors tab in Spark UI, I notice that some RDD blocks are assigned to the driver as well. Can someone please tell me why? Thanks for the help.
Re: AVRO File size when caching in-memory
It's something like the schema shown below (with several additional levels/sublevels) root |-- sentAt: long (nullable = true) |-- sharing: string (nullable = true) |-- receivedAt: long (nullable = true) |-- ip: string (nullable = true) |-- story: struct (nullable = true) ||-- super: string (nullable = true) ||-- lang: string (nullable = true) ||-- setting: string (nullable = true) ||-- myapp: struct (nullable = true) |||-- id: string (nullable = true) |||-- ver: string (nullable = true) |||-- build: string (nullable = true) ||-- comp: struct (nullable = true) |||-- notes: string (nullable = true) |||-- source: string (nullable = true) |||-- name: string (nullable = true) |||-- content: string (nullable = true) |||-- sub: string (nullable = true) ||-- loc: struct (nullable = true) |||-- city: string (nullable = true) |||-- country: string (nullable = true) |||-- lat: double (nullable = true) |||-- long: double (nullable = true) On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > Hi, > > What's the schema interpreted by spark? > A compression logic of the spark caching depends on column types. > > // maropu > > > On Wed, Nov 16, 2016 at 5:26 PM, Prithish <prith...@gmail.com> wrote: > >> Thanks for your response. >> >> I did some more tests and I am seeing that when I have a flatter >> structure for my AVRO, the cache memory use is close to the CSV. But, when >> I use few levels of nesting, the cache memory usage blows up. This is >> really critical for planning the cluster we will be using. To avoid using a >> larger cluster, looks like, we will have to consider keeping the structure >> flat as much as possible. >> >> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com> >> wrote: >> >>> (Adding user@spark back to the discussion) >>> >>> >>> >>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of >>> scope for compression. On the other hand avro and parquet are already >>> compressed and just store extra schema info, afaik. Avro and parquet are >>> both going to make your data smaller, parquet through compressed columnar >>> storage, and avro through its binary data format. >>> >>> >>> >>> Next, talking about the 62kb becoming 1224kb. I actually do not see such >>> a massive blow up. The avro you shared is 28kb on my system and becomes >>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory >>> serialized. Exact same numbers with parquet as well. This is expected >>> behavior, if I am not wrong. >>> >>> >>> >>> In fact, now that I think about it, even larger blow ups might be valid, >>> since your data must have been deserialized from the compressed avro >>> format, making it bigger. The order of magnitude of difference in size >>> would depend on the type of data you have and how well it was compressable. >>> >>> >>> >>> The purpose of these formats is to store data to persistent storage in a >>> way that's faster to read from, not to reduce cache-memory usage. >>> >>> >>> >>> Maybe others here have more info to share. >>> >>> >>> >>> Regards, >>> >>> Shreya >>> >>> >>> >>> Sent from my Windows 10 phone >>> >>> >>> >>> *From: *Prithish <prith...@gmail.com> >>> *Sent: *Tuesday, November 15, 2016 11:04 PM >>> *To: *Shreya Agarwal <shrey...@microsoft.com> >>> *Subject: *Re: AVRO File size when caching in-memory >>> >>> >>> I did another test and noting my observations here. These were done with >>> the same data in avro and csv formats. >>> >>> In AVRO, the file size on disk was 62kb and after caching, the in-memory >>> size is 1224kb >>> In CSV, the file size on disk was 690kb and after caching, the in-memory >>> size is 290kb >>> >>> I'm guessing that the spark caching is not able to compress when the >>> source is avro. Not sure if this is just my immature conclusion. Waiting to >>> hear your observation. >>> >>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote: >>> >>>> Thanks for your response. >>>> >>>> I have attached the code (that I ran using the Spark-shell) as well as >>>> a samp
Re: AVRO File size when caching in-memory
Thanks for your response. I did some more tests and I am seeing that when I have a flatter structure for my AVRO, the cache memory use is close to the CSV. But, when I use few levels of nesting, the cache memory usage blows up. This is really critical for planning the cluster we will be using. To avoid using a larger cluster, looks like, we will have to consider keeping the structure flat as much as possible. On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com> wrote: > (Adding user@spark back to the discussion) > > > > Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of scope > for compression. On the other hand avro and parquet are already compressed > and just store extra schema info, afaik. Avro and parquet are both going to > make your data smaller, parquet through compressed columnar storage, and > avro through its binary data format. > > > > Next, talking about the 62kb becoming 1224kb. I actually do not see such a > massive blow up. The avro you shared is 28kb on my system and becomes > 53.7kb when cached in memory deserialized and 52.9kb when cached In memory > serialized. Exact same numbers with parquet as well. This is expected > behavior, if I am not wrong. > > > > In fact, now that I think about it, even larger blow ups might be valid, > since your data must have been deserialized from the compressed avro > format, making it bigger. The order of magnitude of difference in size > would depend on the type of data you have and how well it was compressable. > > > > The purpose of these formats is to store data to persistent storage in a > way that's faster to read from, not to reduce cache-memory usage. > > > > Maybe others here have more info to share. > > > > Regards, > > Shreya > > > > Sent from my Windows 10 phone > > > > *From: *Prithish <prith...@gmail.com> > *Sent: *Tuesday, November 15, 2016 11:04 PM > *To: *Shreya Agarwal <shrey...@microsoft.com> > *Subject: *Re: AVRO File size when caching in-memory > > > I did another test and noting my observations here. These were done with > the same data in avro and csv formats. > > In AVRO, the file size on disk was 62kb and after caching, the in-memory > size is 1224kb > In CSV, the file size on disk was 690kb and after caching, the in-memory > size is 290kb > > I'm guessing that the spark caching is not able to compress when the > source is avro. Not sure if this is just my immature conclusion. Waiting to > hear your observation. > > On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote: > >> Thanks for your response. >> >> I have attached the code (that I ran using the Spark-shell) as well as a >> sample avro file. After you run this code, the data is cached in memory and >> you can go to the "storage" tab on the Spark-ui (localhost:4040) and see >> the size it uses. In this example the size is small, but in my actual >> scenario, the source file size is 30GB and the in-memory size comes to >> around 800GB. I am trying to understand if this is expected when using avro >> or not. >> >> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <shrey...@microsoft.com> >> wrote: >> >>> I haven’t used Avro ever. But if you can send over a quick sample code, >>> I can run and see if I repro it and maybe debug. >>> >>> >>> >>> *From:* Prithish [mailto:prith...@gmail.com] >>> *Sent:* Tuesday, November 15, 2016 8:44 PM >>> *To:* Jörn Franke <jornfra...@gmail.com> >>> *Cc:* User <user@spark.apache.org> >>> *Subject:* Re: AVRO File size when caching in-memory >>> >>> >>> >>> Anyone? >>> >>> >>> >>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote: >>> >>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this on >>> the latest AWS EMR release. >>> >>> >>> >>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>> spark version? Are you using tungsten? >>> >>> >>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: >>> > >>> > Can someone please explain why this happens? >>> > >>> > When I read a 600kb AVRO file and cache this in memory (using >>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried >>> this with different file sizes, and the size in-memory is always >>> proportionate. I thought Spark compresses when using cacheTable. >>> >>> >>> >>> >>> >> >> >
Re: AVRO File size when caching in-memory
Anyone? On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote: > I am using 2.0.1 and databricks avro library 3.0.1. I am running this on > the latest AWS EMR release. > > On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> wrote: > >> spark version? Are you using tungsten? >> >> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: >> > >> > Can someone please explain why this happens? >> > >> > When I read a 600kb AVRO file and cache this in memory (using >> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried >> this with different file sizes, and the size in-memory is always >> proportionate. I thought Spark compresses when using cacheTable. >> > >
Re: AVRO File size when caching in-memory
I am using 2.0.1 and databricks avro library 3.0.1. I am running this on the latest AWS EMR release. On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> wrote: > spark version? Are you using tungsten? > > > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: > > > > Can someone please explain why this happens? > > > > When I read a 600kb AVRO file and cache this in memory (using > cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried > this with different file sizes, and the size in-memory is always > proportionate. I thought Spark compresses when using cacheTable. >
AVRO File size when caching in-memory
Can someone please explain why this happens? When I read a 600kb AVRO file and cache this in memory (using cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried this with different file sizes, and the size in-memory is always proportionate. I thought Spark compresses when using cacheTable.
Re: Reading AVRO from S3 - No parallelism
The Avro files were 500-600kb in size and that folder contained around 1200 files. The total folder size was around 600mb. Will try repartition. Thank you. > > On Oct 28, 2016 at 2:24 AM, (mailto:mich...@databricks.com)> wrote: > > > > How big are your avro files?We collapse many small files into a single > partition to eliminate scheduler overhead.If you need explicit > parallelism you can also repartition. > > > > On Thu, Oct 27, 2016 at 5:19 AM, Prithish <prith...@gmail.com > (mailto:prith...@gmail.com)> wrote: > > > > > > > I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. > > No matter how many executors I use or what configuration changes I make, > > the cluster doesn't seem to use all the executors. I am using the > > com.databricks.spark.avro library from databricks to read the AVRO. > > > > > > > > However, if I try the same on CSV files (same S3 folder, same configuration > > and cluster), it does use all executors. > > > > > > > > Is there something that I need to do to enable parallelism when using the > > AVRO databricks library? > > > > > > > > Thanks for your help. > > > > > > > > > > >
Reading AVRO from S3 - No parallelism
I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. No matter how many executors I use or what configuration changes I make, the cluster doesn't seem to use all the executors. I am using the com.databricks.spark.avro library from databricks to read the AVRO. However, if I try the same on CSV files (same S3 folder, same configuration and cluster), it does use all executors. Is there something that I need to do to enable parallelism when using the AVRO databricks library? Thanks for your help.
Question about In-Memory size (cache / cacheTable)
Hello, I am trying to understand how in-memory size is changing in these situations. Specifically, why is in-memory size much higher for avro and parquet? Are there any optimizations necessary to reduce this? Used cacheTable on each of these: AVRO File (600kb) - In-memory size was 12mb Parquet File (600kb) - In-memory size was 12mb CSV File (3mb, was the same file as above) - In-memory size was 600Kb Because of this, we need a cluster with a much bigger memory if we were to cache the avro files. Thanks for your help. Prit