Thanks TD, but the sql plan does not seem to provide any information on
which stage is taking longer time or to identify any bottlenecks about
various stages. Spark kafka Direct used to provide information about
various stages in a micro batch and the time taken by each stage. Is there
a way to
-> "test1"
)
val hubbleStream = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams)
)
val kafkaStreamRdd = kafkaStream.transform { rdd =>
There is no difference in performance even with Cache being enabled.
On Mon, Aug 28, 2017 at 11:27 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> There is no difference in performance even with Cache being disabled.
>
> On Mon, Aug 28, 2017 at 7:43 AM, Co
hose settings seem reasonable enough. If preferred locations is
> behaving correctly you shouldn't need cached consumers for all 96
> partitions on any one executor, so that maxCapacity setting is
> probably unnecessary.
>
> On Fri, Aug 25, 2017 at 7:04 PM, swetha kasireddy
&g
er.poll.ms" -> Integer.*valueOf*(1024),
http://markmail.org/message/n4cdxwurlhf44q5x
https://issues.apache.org/jira/browse/SPARK-19185
Also, I have a batch of 60 seconds. What do you suggest the following to
be?
session.timeout.ms, heartbeat.interval.ms
On Fri, Aug 25, 2017 at
Because I saw some posts that say that consumer cache enabled will have
concurrentModification exception with reduceByKeyAndWIndow. I see those
errors as well after running for sometime with cache being enabled. So, I
had to disable it. Please see the tickets below. We have 96 partitions. So
if
Hi Cody,
I think the Assign is used if we want it to start from a specified offset.
What if we want it to start it from the latest offset with something like
returned by "auto.offset.reset" -> "latest",.
Thanks!
On Mon, Aug 21, 2017 at 9:06 AM, Cody Koeninger wrote:
>
Hi,
I am facing issues while trying to recover a textFileStream from checkpoint.
Basically it is trying to load the files from the begining of the job start
whereas I am deleting the files after processing them. I have the following
configs set so was thinking that it should not look for files
OK. Thanks TD. Does stateSnapshots() bring the snapshot of the state of
all the keys managed by mapWithState or does it just bring the state of
the keys in the current micro batch? Its kind of conflicting because the
following link says that it brings the state only for the keys seen in the
Yes, the Spark UI has some information but, it's not that helpful to find
out which particular stage is taking time.
On Wed, Jun 28, 2017 at 12:51 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:
> You can find the information from the spark UI
>
> ---Original---
> *From:* "SRK"
>
at 8:43 AM, swetha kasireddy <swethakasire...@gmail.com>
wrote:
> I changed the datastructure to scala.collection.immutable.Set and I still
> see the same issue. My key is a String. I do the following in my reduce
> and invReduce.
>
> visitorSet1 ++visitorSet2.toTraversa
OK. Can we use Spark Kafka Direct with Structured Streaming?
On Thu, Jun 8, 2017 at 4:46 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:
> OK. Can we use Spark Kafka Direct as part of Structured Streaming?
>
> On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das <tathagata
OK. Can we use Spark Kafka Direct as part of Structured Streaming?
On Thu, Jun 8, 2017 at 3:35 PM, Tathagata Das
wrote:
> YES. At Databricks, our customers have already been using Structured
> Streaming and in the last month alone processed over 3 trillion records.
<tathagata.das1...@gmail.com>
wrote:
> Yes, and in general any mutable data structure. You have to immutable data
> structures whose hashcode and equals is consistent enough for being put in
> a set.
>
> On Jun 6, 2017 4:50 PM, "swetha kasireddy" <swethakasire..
Are you suggesting against the usage of HashSet?
On Tue, Jun 6, 2017 at 3:36 PM, Tathagata Das
wrote:
> This may be because of HashSet is a mutable data structure, and it seems
> you are actually mutating it in "set1 ++set2". I suggest creating a new
> HashMap in
Even the hive configurations like the following would work with this?
sqlContext.setConf("hive.default.fileformat", "Orc")
sqlContext.setConf("hive.exec.orc.memory.pool", "1.0")
sqlContext.setConf("hive.optimize.sort.dynamic.partition", "true")
Can this docker image be used to spin up kafka cluster in a CI/CD pipeline
like Jenkins to run the integration tests? Or it can be done only in the
local machine that has docker installed? I assume that the box where the
CI/CD pipeline runs should have docker installed correct?
On Mon, Jul 4,
The application output is that it inserts data to cassandra at the end of
every batch.
On Mon, Jul 4, 2016 at 5:20 AM, Lars Albertsson wrote:
> I created such a setup for a client a few months ago. It is pretty
> straightforward, but it can take some work to get all the wires
sampleMap is populated from inside a method that is getting called from
updateStateByKey
On Thu, Jun 23, 2016 at 1:13 PM, Ted Yu wrote:
> Can you illustrate how sampleMap is populated ?
>
> Thanks
>
> On Thu, Jun 23, 2016 at 12:34 PM, SRK wrote:
Hi Mich,
No I have not tried that. My requirement is to insert that from an hourly
Spark Batch job. How is it different by trying to insert with Hive CLI or
beeline?
Thanks,
Swetha
On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh wrote:
> Hi Swetha,
>
> Have you
oner').orc("path/to/final/location")
>
> And ORC format is supported with HiveContext only.
>
> Thanks,
> Bijay
>
>
> On Mon, Jun 13, 2016 at 11:41 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Mich,
>>
>> Following
erRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)
On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Hi Bijay,
>
> If I am hitting this issue,
> https://issues.apache.or
Hi Bijay,
If I am hitting this issue,
https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
Incrementing to higher version of hive is the only solution?
Thanks!
On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> Hi,
>
e.
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
No, I am reading the data from hdfs, transforming it , registering the data
in a temp table using registerTempTable and then doing insert overwrite
using Spark SQl' hiveContext.
On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh
wrote:
> how are you doing the insert?
400 cores are assigned to this job.
On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch wrote:
> How many workers (/cpu cores) are assigned to this job?
>
> 2016-06-09 13:01 GMT-07:00 SRK :
>
>> Hi,
>>
>> How to insert data into 2000
ABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:38, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
>> Around 14000 partitions need to be load
nkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 20:24, swetha
The data is not very big. Say 1MB-10 MB at the max per partition. What is
the best way to insert this 14k partitions with decent performance?
On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh wrote:
> the acid question is how many rows are you going to insert in a
gt;
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:43, swetha k
Wh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 May 2016 at 19:11, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
>> I am looking at ORC. I insert the data using the following query.
>>
>
I am looking at ORC. I insert the data using the following query.
sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING,
record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING)
stored as ORC LOCATION '/user/users' ")
sqlContext.sql(" orc.compress=
Also, the Spark SQL insert seems to take only two tasks per stage. That
might be the reason why it does not have sufficient memory. Is there a way
to increase the number of tasks when doing the sql insert?
Stage IdDescriptionSubmittedDurationTasks: Succeeded/TotalInputOutputShuffle
ReadShuffle
Hi Lars,
Do you have any examples for the methods that you described for Spark batch
and Streaming?
Thanks!
On Wed, Mar 30, 2016 at 2:41 AM, Lars Albertsson wrote:
> Thanks!
>
> It is on my backlog to write a couple of blog posts on the topic, and
> eventually some example
irely up to Kafka, which
> is why I'm saying if you're losing leaders, you should look at Kafka.
>
> On Fri, Apr 29, 2016 at 11:21 AM, swetha kasireddy
> <swethakasire...@gmail.com> wrote:
> > OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
> > defa
refresh.leader.backoff.ms and then retry
again depending on the number of retries?
On Fri, Apr 29, 2016 at 8:14 AM, swetha kasireddy <swethakasire...@gmail.com
> wrote:
> OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
> default for rebalancing
OK. So the Kafka side they use rebalance.backoff.ms of 2000 which is a
default for rebalancing and they say that refresh.leader.backoff.ms of 200
to refresh leader is very aggressive and suggested us to increase it to
2000. Even after increasing to 2500 I still get Leader Lost Errors.
Is
OK. I did take a look at them. So once I have an accumulater for a HashSet,
how can I check if a particular key is already present in the HashSet
accumulator? I don't see any .contains method there. My requirement is that
I need to keep accumulating the keys in the HashSet across all the tasks in
Thanks. I tried this yesterday and it seems to be working.
On Wed, Mar 2, 2016 at 1:49 AM, James Hammerton wrote:
> Hi,
>
> Based on the behaviour I've seen using parquet, the number of partitions
> in the DataFrame will determine the number of files in each parquet
> partition.
It seems to be failing when I do something like following in both
sqlContext and hiveContext
sqlContext.sql("SELECT ssd.savedDate from saveSessionDatesRecs ssd
where ssd.partitioner in (SELECT sr1.partitioner from
sparkSessionRecords1 sr1))")
On Tue, Feb 23, 2016 at 5:57 PM, swetha
These tables are stored in hdfs as parquet. Can sqlContext be applied for
the subQueries?
On Tue, Feb 23, 2016 at 5:31 PM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:
> Assuming these are all in Hive, you can either use spark-sql or
> spark-shell.
>
> HiveContext has
of
a number of small files and also to be able to scan faster.
Something like ...df.write.format("parquet").partitionBy( "userIdHash"
, "userId").mode(SaveMode.Append).save("userRecords");
On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <swethakasire
So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId.
On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust
How to use a customPartttioner hashed by userId inside saveAsTable using a
dataframe?
On Mon, Feb 15, 2016 at 11:24 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:
> How about saving the dataframe as a table partitioned by userId? My User
> records have userId, number of ses
ENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 10
>
>
>
> [1998-01,Direct Sales,1823005210]
>
> [1998-01,Internet,248172522]
>
> [1998-01,Partners,474646900]
>
> [1998-02,Direct Sales,1819659036]
>
> [1998-02,Internet,298586496]
>
> [1998
OK. would it only query for the records that I want in hive as per filter
or just load the entire table? My user table will have millions of records
and I do not want to cause OOM errors by loading the entire table in memory.
On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh
Following is the error that I see when it retries.
org.apache.spark.SparkException: Failed to read checkpoint from directory
/share/checkpointDir
at org.apache.spark.streaming.CheckpointReader$.read(Checkpoint.scala:342)
at
Hi,
I want to edit/delete a message posted in Spark User List. How do I do that?
Thanks!
OK. What should the table be? Suppose I have a bunch of parquet files, do I
just specify the directory as the table?
On Fri, Jan 1, 2016 at 11:32 PM, UMESH CHAUDHARY
wrote:
> Ok, so whats wrong in using :
>
> var df=HiveContext.sql("Select * from table where id = ")
>
I see this happens when there is a deadlock situation. The RDD test1 has a
Couchbase call and it seems to be having threads hanging there. Eventhough
all the connections are closed I see the threads related to Couchbase
causing the job to hang for sometime before it gets cleared up.
Would the
Hi,
How to verify whether the GangliaSink directory got created?
Thanks,
Swetha
On Mon, Dec 15, 2014 at 11:29 AM, danilopds wrote:
> Thanks tsingfu,
>
> I used this configuration based in your post: (with ganglia unicast mode)
> # Enable GangliaSink for all instances
>
OK. I think the following can be used.
mvn -Pspark-ganglia-lgpl -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-DskipTests clean package
On Mon, Dec 7, 2015 at 10:13 AM, SRK wrote:
> Hi,
>
> How to do a maven build to enable monitoring using Ganglia? What is the
>
Any documentation/sample code on how to use Ganglia with Spark?
On Sat, Dec 5, 2015 at 10:29 PM, manasdebashiskar
wrote:
> spark has capability to report to ganglia, graphite or jmx.
> If none of that works for you you can register your own spark extra
> listener
> that
Hi Cody,
How to look at Option 2(see the following)? Which portion of the code in
Spark Kafka Direct to look at to handle this issue specific to our
requirements.
2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated
at 3:40 PM, Cody Koeninger <c...@koeninger.org> wrote:
> KafkaRDD.scala , handleFetchErr
>
> On Tue, Dec 1, 2015 at 3:39 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
>>
>> How to look at Option 2(see the following)? Which
Following is the Option 2 that I was talking about:
2.Catch that exception and somehow force things to "reset" for that
partition And how would it handle the offsets already calculated in the
backlog (if there is one)?
On Tue, Dec 1, 2015 at 1:39 PM, swetha kasireddy <swethakasire
s? Can
>> someone manually delete folders from the checkpoint folder to help the job
>> recover? E.g. Go 2 steps back, hoping that kafka has those offsets.
>>
>> -adrian
>>
>> From: swetha kasireddy
>> Date: Monday, November 9, 2015 at 10:40 PM
>> To: Cody Koeni
your situation.
>
> The KafkaRDD will use the value of refresh.leader.backoff.ms, so you can
> try adjusting that to get a longer sleep before retrying the task.
>
> On Mon, Nov 30, 2015 at 1:50 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
&g
code.
>
>
>
>
> On Wed, Nov 25, 2015 at 12:57 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> I am killing my Streaming job using UI. What error code does UI provide
>> if the job is killed from there?
>>
>> On Wed, Nov 25, 2015 at 11:
I am killing my Streaming job using UI. What error code does UI provide if
the job is killed from there?
On Wed, Nov 25, 2015 at 11:01 AM, Kay-Uwe Moosheimer
wrote:
> Testet with Spark 1.5.2 … Works perfect when exit code is non-zero.
> And does not Restart with exit code
:31 AM, Cody Koeninger <c...@koeninger.org> wrote:
> No, the direct stream only communicates with Kafka brokers, not Zookeeper
> directly. It asks the leader for each topicpartition what the highest
> available offsets are, using the Kafka offset api.
>
> On Mon, Nov 23, 201
re the partitions it's failing for all on the same leader?
> Have there been any leader rebalances?
> Do you have enough log retention?
> If you log the offset for each message as it's processed, when do you see
> the problem?
>
> On Tue, Nov 24, 2015 at 10:28 AM, swetha kasired
intln(s"${o.topic} ${o.partition} ${o.fromOffset} ${o.untilOffset}")
}
...
}
On Mon, Nov 23, 2015 at 6:31 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:
> Also, does Kafka direct query the offsets from the zookeeper directly?
> From where does it get the offset
Does Kafka direct query the offsets from the zookeeper directly? From where
does it get the offsets? There is data in those offsets, but somehow Kafka
Direct does not seem to pick it up. Other Consumers that use Zoo Keeper
Quorum of that Stream seems to be fine. Only Kafka Direct seems to have
I mean to show the Spark Kafka Direct consumers in Kafka Stream UI. Usually
we create a consumer and the consumer gets shown in the Kafka Stream UI.
How do I log the offsets in the Spark Job?
On Mon, Nov 23, 2015 at 6:11 PM, Cody Koeninger wrote:
> What exactly do you mean
Also, does Kafka direct query the offsets from the zookeeper directly? From
where does it get the offsets? There is data in those offsets, but somehow
Kafka Direct does not seem to pick it up?
On Mon, Nov 23, 2015 at 6:18 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:
> I mea
That was actually an issue with our Mesos.
On Wed, Nov 18, 2015 at 5:29 PM, Tathagata Das <t...@databricks.com> wrote:
> If possible, could you give us the root cause and solution for future
> readers of this thread.
>
> On Wed, Nov 18, 2015 at 6:37 AM, swetha kasiredd
OK. We have a long running streaming job. I was thinking that may be we
should have a cron to clear files that are older than 2 days. What would be
an appropriate way to do that?
On Wed, Nov 18, 2015 at 7:43 PM, Ted Yu wrote:
> Have you seen SPARK-5836 ?
> Note TD's comment
Looks like I can use mapPartitions but can it be done using
forEachPartition?
On Tue, Nov 17, 2015 at 11:51 PM, swetha wrote:
> Hi,
>
> How to return an RDD of key/value pairs from an RDD that has
> foreachPartition applied. I have my code something like the
It works fine after some changes.
-Thanks,
Swetha
On Tue, Nov 17, 2015 at 10:22 PM, Tathagata Das <t...@databricks.com> wrote:
> Can you verify that the cluster is running the correct version of Spark.
> 1.5.2.
>
> On Tue, Nov 17, 2015 at 7:23 PM, swetha kasireddy <
> s
This error I see locally.
On Tue, Nov 17, 2015 at 5:44 PM, Tathagata Das wrote:
> Are you running 1.5.2-compiled jar on a Spark 1.5.2 cluster?
>
> On Tue, Nov 17, 2015 at 5:34 PM, swetha wrote:
>
>>
>>
>> Hi,
>>
>> I see
I am using the following:
com.twitter
parquet-avro
1.6.0
On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu
wrote:
> Which Spark version used?
>
> It was fixed in Parquet-1.7x, so Spark-1.5.x will be work.
>
>
>
>
> > On Nov 9, 2015, at 3:43 PM, swetha
heckpoint directory is a good way to restart
> the streaming job, you should stop the spark context or at the very least
> kill the driver process, then restart.
>
> On Mon, Nov 9, 2015 at 2:03 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi Cody,
&g
the
>> checkpoint directory in our hdfs.
>>
>> On Mon, Nov 9, 2015 at 12:34 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> I don't think deleting the checkpoint directory is a good way to restart
>>> the streaming job, you should stop
I think they are roughly of equal size.
On Fri, Nov 6, 2015 at 3:45 PM, Ted Yu wrote:
> Can you tell us a bit more about your use case ?
>
> Are the two RDDs expected to be of roughly equal size or, to be of vastly
> different sizes ?
>
> Thanks
>
> On Fri, Nov 6, 2015 at
Hi Ankur,
I have the following questions on IndexedRDD.
1. Does the IndexedRDD support the key types of String? As per the current
documentation, it looks like it supports only Long?
2. Is IndexedRDD efficient when joined with another RDD. So, basically my
usecase is that I need to create an
you call this unpersist, how to call this unpersist (from
> another thread)?
>
> Thanks
> Saisai
>
>
> On Thu, Nov 5, 2015 at 3:13 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Other than setting the following.
>>
>> sparkConf.set("spark.
I am not looking for Spark Sql specifically. My usecase is that I need to
save an RDD as a parquet file in hdfs at the end of a batch and load it
back and convert it into an RDD in the next batch. The RDD has a String and
a Long as the key/value pairs.
On Wed, Nov 4, 2015 at 11:52 PM, Stefano
How to convert a parquet file that is saved in hdfs to an RDD after reading
the file from hdfs?
On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman wrote:
> Hi,
> we are using avro with compression(snappy). As soon as you have enough
> partitions, the saving won't be a problem
t[T]])
sc.newAPIHadoopFile(
parquetFile,
classOf[ParquetInputFormat[T]],
classOf[Void],
tag.runtimeClass.asInstanceOf[Class[T]],
jobConf)
.map(_._2.asInstanceOf[T])
}
On Thu, Nov 5, 2015 at 2:14 PM, swetha kasireddy <swethakasire...@gmail.com>
wrote:
> No scala. Sup
ng from java - toJavaRDD
> <https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/DataFrame.html#toJavaRDD()>*
> ()
>
> On 5 November 2015 at 21:13, swetha kasireddy <swethakasire...@gmail.com>
> wrote:
>
>> How to convert a parquet file that is saved
ab/spark-indexedrdd
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
> 07714140812
>
>
>
> On 2 November 2015 at 23:29, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Please take a look at SPARK-2365
>>
>> On Mon, Nov 2, 2015 at 3:25 PM, swetha ka
Other than setting the following.
sparkConf.set("spark.streaming.unpersist", "true")
sparkConf.set("spark.cleaner.ttl", "7200s")
On Wed, Nov 4, 2015 at 5:03 PM, swetha wrote:
> Hi,
>
> How to unpersist a DStream in Spark Streaming? I know that we can persist
> using
Hi,
Is Indexed RDDs released yet?
Thanks,
Swetha
On Sun, Nov 1, 2015 at 1:21 AM, Gylfi wrote:
> Hi.
>
> You may want to look into Indexed RDDs
> https://github.com/amplab/spark-indexedrdd
>
> Regards,
> Gylfi.
>
>
>
>
>
>
> --
> View this message in context:
>
So, Wouldn't using a customPartitioner on the rdd upon which the
groupByKey or reduceByKey is performed avoid shuffles and improve
performance? My code does groupByAndSort and reduceByKey on different
datasets as shown below. Would using a custom partitioner on those datasets
before using a
unt memory allocated for shuffles by changing
> the configuration spark.shuffle.memoryFraction . More fraction would cause
> less spilling.
>
>
> On Tue, Oct 27, 2015 at 6:35 PM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> So, Wouldn't
ow it could affect performance.
>
> Used correctly it should improve performance as you can better control
> placement of data and avoid shuffling…
>
> -adrian
>
> From: swetha kasireddy
> Date: Monday, October 26, 2015 at 6:56 AM
> To: Adrian Tanase
> Cc: Bill Bejeck, "us
Hi,
Does the use of custom partitioner in Streaming affect performance?
On Mon, Oct 5, 2015 at 1:06 PM, Adrian Tanase wrote:
> Great article, especially the use of a custom partitioner.
>
> Also, sorting by multiple fields by creating a tuple out of them is an
> awesome,
Hi Cody,
What other options do I have other than monitoring and restarting the job?
Can the job recover automatically?
Thanks,
Sweth
On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger wrote:
> Did you check you kafka broker logs to see what was going on during that
> time?
>
>
What about cleaning up the tempData that gets generated by shuffles. We
have a lot of temp data that gets generated by shuffles in /tmp folder.
That's why we are using ttl. Also if I keep an RDD in cache is it available
across all the executors or just the same executor?
On Fri, Oct 16, 2015 at
We have limited disk space. So, can we have spark.cleaner.ttl to clean up
the files? Or is there any setting that can cleanup old temp files?
On Mon, Sep 28, 2015 at 7:02 PM, Shixiong Zhu wrote:
> These files are created by shuffle and just some temp files. They are not
>
91 matches
Mail list logo