Thanks everyone. As Nathan suggested, I ended up collecting the distinct
keys first and then assigning Ids to each key explicitly.
Regards
Sumit Chawla
On Fri, Jun 22, 2018 at 7:29 AM, Nathan Kronenfeld <
nkronenfeld@uncharted.software> wrote:
> On Thu, Jun 21, 2018 at 4:51 PM, Cha
Based on code read it looks like Spark does modulo of key for partition.
Keys of c and b end up pointing to same value. Whats the best partitioning
scheme to deal with this?
Regards
Sumit Chawla
On Thu, Jun 21, 2018 at 4:51 PM, Chawla,Sumit
wrote:
> Hi
>
> I have been trying to th
Hi
I have been trying to this simple operation. I want to land all values
with one key in same partition, and not have any different key in the same
partition. Is this possible? I am getting b and c always getting mixed
up in the same partition.
rdd = sc.parallelize([('a', 5), ('d', 8),
Hi
Anybody got any pointers on this one?
Regards
Sumit Chawla
On Tue, Mar 6, 2018 at 8:58 AM, Chawla,Sumit <sumitkcha...@gmail.com> wrote:
> No, This is the only Stack trace i get. I have tried DEBUG but didn't
> notice much of a log change.
>
> Yes, I
this number to appropriate
value.
Regards
Sumit Chawla
On Tue, Mar 6, 2018 at 8:07 AM, Vadim Semenov <va...@datadoghq.com> wrote:
> Do you have a trace? i.e. what's the source of `io.netty.*` calls?
>
> And have you tried bumping `-XX:MaxDirectMemorySize`?
>
> On Tue, Mar 6, 2018 at 1
Hi All
I have a job which processes a large dataset. All items in the dataset are
unrelated. To save on cluster resources, I process these items in
chunks. Since chunks are independent of each other, I start and shut down
the spark context for each chunk. This allows me to keep DAG smaller
out. I’m not
>> sure if that does exactly the same thing. The default for that setting is
>> 1h instead of 0. It’s better to have a non-zero default to avoid what
>> you’re seeing.
>>
>> rb
>>
>>
>> On Fri, Apr 21, 2017 at 1:32 PM, Chawla,Sumit <
I am seeing a strange issue. I had a bad behaving slave that failed the
entire job. I have set spark.task.maxFailures to 8 for my job. Seems like
all task retries happen on the same slave in case of failure. My
expectation was that task will be retried on different slave in case of
failure, and
Hi All
I have a rdd, which i partition based on some key, and then can sc.runJob
for each partition.
Inside this function, i assign each partition a unique key using following:
"%s_%s" % (id(part), int(round(time.time()))
This is to make sure that, each partition produces separate bookeeping
Would this work for you?
def processRDD(rdd):
analyzer = ShortTextAnalyzer(root_dir)
rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1]))
ssc.union(*streams).filter(lambda x: x[1] != None)
.foreachRDD(lambda rdd: processRDD(rdd))
Regards
Sumit Chawla
On Wed, Dec
;> Since the direct result (up to 1M by default) will also go through
>> mesos, it's better to tune it lower, otherwise mesos could become the
>> bottleneck.
>>
>> spark.task.maxDirectResultSize
>>
>> On Mon, Dec 19, 2016 at 3:23 PM, Chawla,Sumit <sumitkcha
> Objet: Re: Mesos Spark Fine Grained Execution - CPU count
> >
> >
> >> Is this problem of idle executors sticking around solved in Dynamic
> >> Resource Allocation? Is there some timeout after which Idle executors
> can
> >> just shutdown and clean
ub.com/apache/spark/blob/v1.6.3/docs/running-on-mesos.md)
> and spark.task.cpus (https://github.com/apache/spark/blob/v1.6.3/docs/
> configuration.md)
>
> On Mon, Dec 19, 2016 at 12:09 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
>> Ah thanks. looks like i skipp
Mesosphere
>
> On Mon, Dec 19, 2016 at 11:26 AM, Timothy Chen <tnac...@gmail.com> wrote:
>
>> Hi Chawla,
>>
>> One possible reason is that Mesos fine grain mode also takes up cores
>> to run the executor per host, so if you have 20 agents running Fine
>&
ating the
> fine-grained scheduler, and no one seemed too dead-set on keeping it. I'd
> recommend you move over to coarse-grained.
>
> On Fri, Dec 16, 2016 at 8:41 AM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
>> Hi
>>
>> I am using Spark 1.6. I hav
Hi
I am using Spark 1.6. I have one query about Fine Grained model in Spark.
I have a simple Spark application which transforms A -> B. Its a single
stage application. To begin the program, It starts with 48 partitions.
When the program starts running, in mesos UI it shows 48 tasks and 48 CPUs
sorry for hijacking this thread.
@irving, how do you restart a spark job from checkpoint?
Regards
Sumit Chawla
On Fri, Dec 16, 2016 at 2:24 AM, Selvam Raman wrote:
> Hi,
>
> Acutally my requiremnt is read the parquet file which is 100 partition.
> Then i use
ns functions. It is just a function that you can run
> arbitrary code after all.
>
>
> On Thu, Dec 15, 2016 at 11:33 AM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
>> Any suggestions on this one?
>>
>> Regards
>> Sumit Chawla
>>
>>
>> O
Any suggestions on this one?
Regards
Sumit Chawla
On Tue, Dec 13, 2016 at 8:31 AM, Chawla,Sumit <sumitkcha...@gmail.com>
wrote:
> Hi All
>
> I have a workflow with different steps in my program. Lets say these are
> steps A, B, C, D. Step B produces some temp files on
Hi All
I have a workflow with different steps in my program. Lets say these are
steps A, B, C, D. Step B produces some temp files on each executor node.
How can i add another step E which consumes these files?
I understand the easiest choice is to copy all these temp files to any
shared
//github.com/hammerlab/spark-json-relay if it serves
> your need.
>
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Wed, Dec 7, 2016 at 1:10 AM, Chawla,Sumit <sumitkcha...@gmail.com>
Any pointers on this?
Regards
Sumit Chawla
On Mon, Dec 5, 2016 at 8:30 PM, Chawla,Sumit <sumitkcha...@gmail.com> wrote:
> An example implementation i found is : https://github.com/groupon/
> spark-metrics
>
> Anyone has any experience using this? I am more inter
to spend some time reading it, but any quick pointers will be
appreciated.
Regards
Sumit Chawla
On Mon, Dec 5, 2016 at 8:17 PM, Chawla,Sumit <sumitkcha...@gmail.com> wrote:
> Hi Manish
>
> I am specifically looking for something similar to following:
>
> https://ci.apache.org
nodes.
>
> If you are using Mesos as Resource Manager , mesos exposes metrics as well
> for the running job.
>
> Manish
>
> ~Manish
>
>
>
> On Mon, Dec 5, 2016 at 4:17 PM, Chawla,Sumit <sumitkcha...@gmail.com>
> wrote:
>
>> Hi All
>>
>>
Hi All
I have a long running job which takes hours and hours to process data. How
can i monitor the operational efficency of this job? I am interested in
something like Storm\Flink style User metrics/aggregators, which i can
monitor while my job is running. Using these metrics i want to
Hi Sean
Could you please elaborate on how can this be done on a per partition
basis?
Regards
Sumit Chawla
On Thu, Oct 27, 2016 at 7:44 AM, Walter rakoff
wrote:
> Thanks for the info Sean.
>
> I'm initializing them in a singleton but Scala objects are evaluated
>
t -
>>
>> I could see that SparkConf() specification is not being mentioned in your
>> program. But rest looks good.
>>
>>
>>
>> Output:
>>
>>
>> By the way, I have used the README.md template https://gis
>> t.github.com/jxson/1784669
>>
>
+1 Jakob. Thanks for the link
Regards
Sumit Chawla
On Wed, Sep 21, 2016 at 2:54 PM, Jakob Odersky wrote:
> One option would be to use Apache Toree. A quick setup guide can be
> found here https://toree.incubator.apache.org/documentation/user/
> quick-start
>
> On Wed, Sep
Hi All
I am trying to test a simple Spark APP using scala.
import org.apache.spark.SparkContext
object SparkDemo {
def main(args: Array[String]) {
val logFile = "README.md" // Should be some file on your system
// to run in local mode
val sc = new SparkContext("local", "Simple
How are you producing data? I just tested your code and i can receive the
messages from Kafka.
Regards
Sumit Chawla
On Sun, Sep 18, 2016 at 7:56 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:
> i am very new to *Spark streaming* and i am implementing small exercise
> like sending
30 matches
Mail list logo