Dataframe size using RDDStorageInfo objects

2018-03-17 Thread Bahubali Jain
Hi, I am trying to figure out a way to find the size of *persisted *dataframes using the *sparkContext.getRDDStorageInfo() * RDDStorageInfo object has information related to the number of bytes stored in memory and on disk. For eg: I have 3 dataframes which i have cached. df1.cache() df2.cache()

Compression during shuffle writes

2017-11-09 Thread Bahubali Jain
Hi, I have compressed data of size 500GB .I am repartitioning this data since the underlying data is very skewed and is causing a lot of issues for the downstream jobs. During repartioning the *shuffles writes* are not getting compressed due to this I am running into disk space issues.Below is the

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
case. > > > If you are looking for workaround, the JIRA ticket clearly show you how to > increase your driver heap. 1G in today's world really is kind of small. > > > Yong > > > -- > *From:* Bahubali Jain <bahub...@gmail.com> > *

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
owse/SPARK-12837> > issues.apache.org > Executing a sql statement with a large number of partitions requires a > high memory space for the driver even there are no requests to collect data > back to the driver. > > > > -- > *From:* Bahubali J

Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
Hi, While saving a dataset using * mydataset.write().csv("outputlocation") * I am running into an exception *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"* Does it mean that for saving a dataset whole of

SPARK ML- Feature Selection Techniques

2016-09-05 Thread Bahubali Jain
Hi, Do we have any feature selection techniques implementation(wrapper methods,embedded methods) available in SPARK ML ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
e VectorIndexer which returns the model, then > add the model to the pipeline where it will only transform. > > val featureVectorIndexer = new VectorIndexer() > .setInputCol("feature") > .setOutputCol("indexedfeature") > .setMaxCategories(180) >

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain
Hi, I had run into similar exception " java.util.NoSuchElementException: key not found: " . After further investigation I realized it is happening due to vectorindexer being executed on training dataset and not on entire dataset. In the dataframe I have 5 categories , each of these have to go

Large files with wholetextfile()

2016-07-12 Thread Bahubali Jain
Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: data of record 1.. data of record 2.. Expected output is Since we needed file name as well in output ,we chose wholetextfile() . We had to go

DAG related query

2015-08-20 Thread Bahubali Jain
Hi, How would the DAG look like for the below code JavaRDDString rdd1 = context.textFile(SOMEPATH); JavaRDDString rdd2 = rdd1.map(DO something); rdd1 = rdd2.map(Do SOMETHING); Does this lead to any kind of cycle? Thanks, Baahu

JavaRDD and saveAsNewAPIHadoopFile()

2015-06-27 Thread Bahubali Jain
Hi, Why doesn't JavaRDD has saveAsNewAPIHadoopFile() associated with it. Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain
Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain
Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean how many times was the page accessed *till now* I have the log files for past 2

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain
...@sigmoidanalytics.com wrote: Did you try : temp.saveAsHadoopFiles(DailyCSV,.txt, String.class, String.class,(Class) TextOutputFormat.class); Thanks Best Regards On Wed, Feb 11, 2015 at 9:40 AM, Bahubali Jain bahub...@gmail.com wrote: Hi, I am facing issues while writing data from

textFileStream() issue?

2014-12-03 Thread Bahubali Jain
Hi, I am trying to use textFileStream(some_hdfs_location) to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I move them from a location with in hdfs to location at which textFileStream() is checking for new files.

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain
Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, pankaj pankaje...@gmail.com wrote: Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time

Re: Help with Spark Streaming

2014-11-16 Thread Bahubali Jain
Hi, Can anybody help me on this please, haven't been able to find the problem :( Thanks. On Nov 15, 2014 4:48 PM, Bahubali Jain bahub...@gmail.com wrote: Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window

Help with Spark Streaming

2014-11-15 Thread Bahubali Jain
Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window basis), so I am using updateStateByKey(), but for some reason this is not working. The function it self is not being invoked(do not see the sysout output on