Hi,
In our use case of using the groupByKey(...): RDD[(K, Iterable[V]], there
might be a case that even for a single key (an extreme case though), the
associated Iterable[V] could resulting in OOM.
Is it possible to provide the above 'groupByKeyWithRDD'?
And, ideally, it would be great if the
Confirming that I am also hitting the same errors.
host: r3.8xlarge
configuration
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 200g
spark.serializer.objectStreamReset 10
Hi Cody ,
Thanks for the clarification. I will try to come up with some workaround.
I have an another doubt. When my job is restarted, and recovers from the
checkpoint it does the re-partitioning step twice for each 15 minute job
until the window of 2 hours is complete. Then the re-partitioning
I have tried with G1 GC .Please if anyone can provide their setting for GC.
At code level I am :
1.reading orc table usind dataframe
2.map df to rdd of my case class
3. changed that rdd to paired rdd
4.Applied combineByKey
5. saving the result to orc file
Please suggest
Regards,
Renu Yadav
On
Which release are you using ?
If older than 1.5.0, you miss some fixes such as SPARK-9952
Cheers
On Sat, Nov 14, 2015 at 6:35 PM, Jerry Lam wrote:
> Hi spark users and developers,
>
> Have anyone experience the slow startup of a job when it contains a stage
> with over 4
Hi Ted,
That looks exactly what happens. It has been 5 hrs now. The code was built for
1.4. Thank you very much!
Best Regards,
Jerry
Sent from my iPhone
> On 14 Nov, 2015, at 11:21 pm, Ted Yu wrote:
>
> Which release are you using ?
> If older than 1.5.0, you miss
Hi,
I am working on requirement of calculating real time metrics and building
prototype on Spark streaming. I need to build aggregate at Seconds, Minutes,
Hours and Day level.
I am not sure whether I should calculate all these aggregates as different
Windowed function on input DStream or
I'm using pyspark 1.3.0, and struggling with what should be simple.
Basically, I'd like to run this:
site_logs.filter(lambda r: 'page_row' in r.request[:20])
meaning that I want to keep rows that have 'page_row' in the first 20
characters of the request column. The following is the closest
Hi Walrus,
Try caching the results just before calling the rdd.count.
Regards,
Ajay
> On Nov 13, 2015, at 7:56 PM, Walrus theCat wrote:
>
> Hi,
>
> I have an RDD which crashes the driver when being collected. I want to send
> the data on its partitions out to S3
How are you writing it out? Can you post some code?
Regards
Sab
On 14-Nov-2015 5:21 am, "Rok Roskar" wrote:
> I'm not sure what you mean? I didn't do anything specifically to partition
> the columns
> On Nov 14, 2015 00:38, "Davies Liu" wrote:
>
>>
You are probably trying to access the spring context from the executors
after initializing it at the driver. And running into serialization issues.
You could instead use mapPartitions() and initialize the spring context
from within that.
That said I don't think that will solve all of your issues
You could try to use akka actor system with apache spark, if you are
intending to use it in online / interactive job execution scenario.
On Sat, Nov 14, 2015, 08:19 Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> You are probably trying to access the spring context from the
Maybe you want to be using rdd.saveAsTextFile() ?
> On Nov 13, 2015, at 4:56 PM, Walrus theCat wrote:
>
> Hi,
>
> I have an RDD which crashes the driver when being collected. I want to send
> the data on its partitions out to S3 without bringing it back to the driver.
Hello,
We are running spark on yarn version 1.4.1
java.vendor=Oracle Corporation
java.runtime.version=1.7.0_40-b43
datanucleus-core-3.2.10.jar
datanucleus-api-jdo-3.2.6.jar
datanucleus-rdbms-3.2.9.jar
IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDuration ▾GC
TimeInput Size /
14 matches
Mail list logo