Re: parquet vs orc files

2018-03-01 Thread Sushrut Ikhar
To add, schema evaluation is better for parquet compared to orc (at the cost of a bit slowness) as orc is truly index based; especially useful in case you would want to delete some column later. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?pr

Re: Using Thrift with Dataframe

2018-03-01 Thread Sushrut Ikhar
https://github.com/airbnb/airbnb-spark-thrift Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig> On Thu, Mar 1, 2018 at 6:05 AM, Nikhil Goyal <nownik...@gmail.com> wrote: > Hi guys, > > I have a RDD of thrift stru

Driver storage memory getting waste

2016-10-17 Thread Sushrut Ikhar
Hi, Is there any config to change the storage memory fraction for driver; as i'm not caching anything in driver and by default it is picking from spark.memory.fraction (0.9) spark.memory.storageFraction (0.6); whose value i've set as per my executor usage. Regards, Sushrut Ikhar [image: https

Re: Spark Executor Lost issue

2016-09-28 Thread Sushrut Ikhar
Can you add more details like are you using rdds/datasets/sql ..; are you doing group by/ joins ; is your input splittable? btw, you can pass the config the same way you are passing memryOverhead: e.g. --conf spark.default.parallelism=1000 or through spark-context in code Regards, Sushrut Ikhar

Re: mapValues Transformation (JavaPairRDD)

2015-12-15 Thread Sushrut Ikhar
Well the issue was because I was using some non thread-safe functions for generating the key. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig> On Tue, Dec 15, 2015 at 2:27 PM, Paweł Szulc <paul.sz...@gmail.com> wr

mapValues Transformation (JavaPairRDD)

2015-12-14 Thread Sushrut Ikhar
-1.4.1. Thanks in advance. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig>

Re: merge 3 different types of RDDs in one

2015-12-01 Thread Sushrut Ikhar
Hi, I have myself used union in a similar case. And applied reduceByKey on it. Union + reduceByKey will suffice join... but you will have to first use Map so that all values are of same datatype Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutik

Re: Best practises

2015-11-02 Thread Sushrut Ikhar
This presentation may clarify many of your doubts. https://www.youtube.com/watch?v=7ooZ4S7Ay6Y Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig> On Mon, Nov 2, 2015 at 7:15 PM, Denny Lee <denny.g@gmail.com> wrote: &

Split RDD into multiple RDDs using filter-transformation

2015-11-02 Thread Sushrut Ikhar
shows that no RDD partitioned are actually being cached. How do I split then without shuffling thrice? Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <https://about.me/sushrutikhar?promo=email_sig>

Re: Running Spark in Yarn-client mode

2015-10-08 Thread Sushrut Ikhar
Hey Jean, Thanks for the quick response. I am using spark 1.4.1 pre-built with hadoop 2.6. Yes the Yarn cluster has multiple running worker nodes. It would a great help if you can tell how to look for the executors logs. Regards, Sushrut Ikhar [image: https://]about.me/sushrutikhar <ht

Running Spark in Yarn-client mode

2015-10-07 Thread Sushrut Ikhar
is now gated for [5000] ms. Reason is: [Disassociated]. I believe that executors are starting but are unable to connect back to the driver. How do I resolve this? Also, I need help in locating the driver and executor node logs. Thanks. Regards, Sushrut Ikhar [image: https://]about.me