from:"Bahubali Jain"

Dataframe size using RDDStorageInfo objects

2018-03-17 Thread Bahubali Jain

Hi, I am trying to figure out a way to find the size of *persisted *dataframes using the *sparkContext.getRDDStorageInfo() * RDDStorageInfo object has information related to the number of bytes stored in memory and on disk. For eg: I have 3 dataframes which i have cached. df1.cache() df2.cache()

Compression during shuffle writes

2017-11-09 Thread Bahubali Jain

Hi, I have compressed data of size 500GB .I am repartitioning this data since the underlying data is very skewed and is causing a lot of issues for the downstream jobs. During repartioning the *shuffles writes* are not getting compressed due to this I am running into disk space issues.Below is the

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain

case. > > > If you are looking for workaround, the JIRA ticket clearly show you how to > increase your driver heap. 1G in today's world really is kind of small. > > > Yong > > > -- > *From:* Bahubali Jain <bahub...@gmail.com> > *

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain

owse/SPARK-12837> > issues.apache.org > Executing a sql statement with a large number of partitions requires a > high memory space for the driver even there are no requests to collect data > back to the driver. > > > > -- > *From:* Bahubali J

Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain

Hi, While saving a dataset using * mydataset.write().csv("outputlocation") * I am running into an exception *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"* Does it mean that for saving a dataset whole of

SPARK ML- Feature Selection Techniques

2016-09-05 Thread Bahubali Jain

Hi, Do we have any feature selection techniques implementation(wrapper methods,embedded methods) available in SPARK ML ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain

e VectorIndexer which returns the model, then > add the model to the pipeline where it will only transform. > > val featureVectorIndexer = new VectorIndexer() > .setInputCol("feature") > .setOutputCol("indexedfeature") > .setMaxCategories(180) >

Re: Random Forest Classification

2016-08-30 Thread Bahubali Jain

Hi, I had run into similar exception " java.util.NoSuchElementException: key not found: " . After further investigation I realized it is happening due to vectorindexer being executed on training dataset and not on entire dataset. In the dataframe I have 5 categories , each of these have to go

Large files with wholetextfile()

2016-07-12 Thread Bahubali Jain

Hi, We have a requirement where in we need to process set of xml files, each of the xml files contain several records (eg: data of record 1.. data of record 2.. Expected output is Since we needed file name as well in output ,we chose wholetextfile() . We had to go

DAG related query

2015-08-20 Thread Bahubali Jain

Hi, How would the DAG look like for the below code JavaRDDString rdd1 = context.textFile(SOMEPATH); JavaRDDString rdd2 = rdd1.map(DO something); rdd1 = rdd2.map(Do SOMETHING); Does this lead to any kind of cycle? Thanks, Baahu

JavaRDD and saveAsNewAPIHadoopFile()

2015-06-27 Thread Bahubali Jain

Hi, Why doesn't JavaRDD has saveAsNewAPIHadoopFile() associated with it. Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Multiple dir support : newApiHadoopFile

2015-06-26 Thread Bahubali Jain

Hi, How do we read files from multiple directories using newApiHadoopFile () ? Thanks, Baahu -- Twitter:http://twitter.com/Baahu

Pseudo Spark Streaming ?

2015-04-05 Thread Bahubali Jain

Hi, I have a requirement in which I plan to use the SPARK Streaming. I am supposed to calculate the access count to certain webpages.I receive the webpage access information thru log files. By Access count I mean how many times was the page accessed *till now* I have the log files for past 2

Re: Writing to HDFS from spark Streaming

2015-02-15 Thread Bahubali Jain

...@sigmoidanalytics.com wrote: Did you try : temp.saveAsHadoopFiles(DailyCSV,.txt, String.class, String.class,(Class) TextOutputFormat.class); Thanks Best Regards On Wed, Feb 11, 2015 at 9:40 AM, Bahubali Jain bahub...@gmail.com wrote: Hi, I am facing issues while writing data from

textFileStream() issue?

2014-12-03 Thread Bahubali Jain

Hi, I am trying to use textFileStream(some_hdfs_location) to pick new files from a HDFS location.I am seeing a pretty strange behavior though. textFileStream() is not detecting new files when I move them from a location with in hdfs to location at which textFileStream() is checking for new files.

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread Bahubali Jain

Hi, You can associate all the messages of a 3min interval with a unique key and then group by and finally add up. Thanks On Dec 1, 2014 9:02 PM, pankaj pankaje...@gmail.com wrote: Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time

Re: Help with Spark Streaming

2014-11-16 Thread Bahubali Jain

Hi, Can anybody help me on this please, haven't been able to find the problem :( Thanks. On Nov 15, 2014 4:48 PM, Bahubali Jain bahub...@gmail.com wrote: Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window

Help with Spark Streaming

2014-11-15 Thread Bahubali Jain

Hi, Trying to use spark streaming, but I am struggling with word count :( I want consolidate output of the word count (not on a per window basis), so I am using updateStateByKey(), but for some reason this is not working. The function it self is not being invoked(do not see the sysout output on

Dataframe size using RDDStorageInfo objects

Compression during shuffle writes

Re: Dataset : Issue with Save

Re: Dataset : Issue with Save

Dataset : Issue with Save

SPARK ML- Feature Selection Techniques

Re: Random Forest Classification

Re: Random Forest Classification

Large files with wholetextfile()

DAG related query

JavaRDD and saveAsNewAPIHadoopFile()

Multiple dir support : newApiHadoopFile

Pseudo Spark Streaming ?

Re: Writing to HDFS from spark Streaming

textFileStream() issue?

Re: Time based aggregation in Real time Spark Streaming

Re: Help with Spark Streaming

Help with Spark Streaming

18 matches

Site Navigation

Mail list logo

Footer information