Hi,
I am trying to figure out a way to find the size of *persisted *dataframes
using the *sparkContext.getRDDStorageInfo() *
RDDStorageInfo object has information related to the number of bytes stored
in memory and on disk.
For eg:
I have 3 dataframes which i have cached.
df1.cache()
df2.cache()
Hi,
I have compressed data of size 500GB .I am repartitioning this data since
the underlying data is very skewed and is causing a lot of issues for the
downstream jobs.
During repartioning the *shuffles writes* are not getting compressed due to
this I am running into disk space issues.Below is the
case.
>
>
> If you are looking for workaround, the JIRA ticket clearly show you how to
> increase your driver heap. 1G in today's world really is kind of small.
>
>
> Yong
>
>
> --
> *From:* Bahubali Jain <bahub...@gmail.com>
> *
owse/SPARK-12837>
> issues.apache.org
> Executing a sql statement with a large number of partitions requires a
> high memory space for the driver even there are no requests to collect data
> back to the driver.
>
>
>
> --
> *From:* Bahubali J
Hi,
While saving a dataset using *
mydataset.write().csv("outputlocation") * I am running
into an exception
*"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than
spark.driver.maxResultSize (1024.0 MB)"*
Does it mean that for saving a dataset whole of
Hi,
Do we have any feature selection techniques implementation(wrapper
methods,embedded methods) available in SPARK ML ?
Thanks,
Baahu
--
Twitter:http://twitter.com/Baahu
e VectorIndexer which returns the model, then
> add the model to the pipeline where it will only transform.
>
> val featureVectorIndexer = new VectorIndexer()
> .setInputCol("feature")
> .setOutputCol("indexedfeature")
> .setMaxCategories(180)
>
Hi,
I had run into similar exception " java.util.NoSuchElementException: key
not found: " .
After further investigation I realized it is happening due to vectorindexer
being executed on training dataset and not on entire dataset.
In the dataframe I have 5 categories , each of these have to go
Hi,
We have a requirement where in we need to process set of xml files, each of
the xml files contain several records (eg:
data of record 1..
data of record 2..
Expected output is
Since we needed file name as well in output ,we chose wholetextfile() . We
had to go
Hi,
How would the DAG look like for the below code
JavaRDDString rdd1 = context.textFile(SOMEPATH);
JavaRDDString rdd2 = rdd1.map(DO something);
rdd1 = rdd2.map(Do SOMETHING);
Does this lead to any kind of cycle?
Thanks,
Baahu
Hi,
Why doesn't JavaRDD has saveAsNewAPIHadoopFile() associated with it.
Thanks,
Baahu
--
Twitter:http://twitter.com/Baahu
Hi,
How do we read files from multiple directories using newApiHadoopFile () ?
Thanks,
Baahu
--
Twitter:http://twitter.com/Baahu
Hi,
I have a requirement in which I plan to use the SPARK Streaming.
I am supposed to calculate the access count to certain webpages.I receive
the webpage access information thru log files.
By Access count I mean how many times was the page accessed *till now*
I have the log files for past 2
...@sigmoidanalytics.com
wrote:
Did you try :
temp.saveAsHadoopFiles(DailyCSV,.txt, String.class,
String.class,(Class)
TextOutputFormat.class);
Thanks
Best Regards
On Wed, Feb 11, 2015 at 9:40 AM, Bahubali Jain bahub...@gmail.com
wrote:
Hi,
I am facing issues while writing data from
Hi,
I am trying to use textFileStream(some_hdfs_location) to pick new files
from a HDFS location.I am seeing a pretty strange behavior though.
textFileStream() is not detecting new files when I move them from a
location with in hdfs to location at which textFileStream() is checking for
new files.
Hi,
You can associate all the messages of a 3min interval with a unique key and
then group by and finally add up.
Thanks
On Dec 1, 2014 9:02 PM, pankaj pankaje...@gmail.com wrote:
Hi,
My incoming message has time stamp as one field and i have to perform
aggregation over 3 minute of time
Hi,
Can anybody help me on this please, haven't been able to find the problem
:(
Thanks.
On Nov 15, 2014 4:48 PM, Bahubali Jain bahub...@gmail.com wrote:
Hi,
Trying to use spark streaming, but I am struggling with word count :(
I want consolidate output of the word count (not on a per window
Hi,
Trying to use spark streaming, but I am struggling with word count :(
I want consolidate output of the word count (not on a per window basis), so
I am using updateStateByKey(), but for some reason this is not working.
The function it self is not being invoked(do not see the sysout output on
18 matches
Mail list logo