; process or the store).
> >
> > Can you check if the object store evicting objects (it prints something
> to
> > stdout/stderr when this happens)? Could you be running out of memory but
> > failing to release the objects?
> >
> > On Tue, Jul 10, 2018 at 9
kernel.
On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet wrote:
> Wes,
>
> Unfortunately, my code is on a separate network. I'll try to explain what
> I'm doing and if you need further detail, I can certainly pseudocode
> specifics.
>
> I am using multiprocessing.Pool() to fire
ain!
On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney wrote:
> hi Corey,
>
> Can you provide the code (or a simplified version thereof) that shows
> how you're using Plasma?
>
> - Wes
>
> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet wrote:
> > I'm on a system with 12T
I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma
client to convert a series of CSV files (via Pandas) into a Parquet store.
I've got a little over 20k CSV files to process which are about 1-2gb each.
I'm loading 500 to 1000 files at a time.
In each iteration, I'm
g-
> apache-spark-on-a-single-node-machine.html
>
> regars,
>
> 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>:
>
>> Please forgive me if this question has been asked already.
>>
>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'
Please forgive me if this question has been asked already.
I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
anyone knows of any efforts to implement the PySpark API on top of Apache
Arrow directly. In my case, I'm doing data science on a machine with 288
cores and 1TB of
a Plasma client object successfully (it has
> a socket connection to the store).
>
> On Wed, May 16, 2018 at 3:43 PM Corey Nolet <cjno...@gmail.com> wrote:
>
>> Robert,
>>
>> Thank you for the quick response. I've been playing around for a few hours
>> to ge
as a replacement for
> Python multiprocessing that automatically uses shared memory and Arrow for
> serialization.
>
> On Wed, May 16, 2018 at 10:02 AM Corey Nolet <cjno...@gmail.com> wrote:
>
> > I've been reading through the PyArrow documentation and trying to
> > und
I've been reading through the PyArrow documentation and trying to
understand how to use the tool effectively for IPC (using zero-copy).
I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes
to process several 10's of gigs of data in memory and the pickling that is
done by
the
other user gets worked into the model.
On Mon, Nov 27, 2017 at 3:08 PM, Corey Nolet <cjno...@gmail.com> wrote:
> I'm trying to use the MatrixFactorizationModel to, for instance, determine
> the latent factors of a user or item that were not used in the training
> data of
I'm trying to use the MatrixFactorizationModel to, for instance, determine
the latent factors of a user or item that were not used in the training
data of the model. I'm not as concerned about the rating as I am with the
latent factors for the user/item.
Thanks!
ensor2, e_tensor[i,:,:]))
output = np.array(new_tensor).reshape(7,16,16)
On Tuesday, January 17, 2017 at 10:47:24 AM UTC-5, Corey Nolet wrote:
>
> I have a tensor which is (7,16,16), let's denote as Tijk. I need to make
> the following function:
>
> for j in range(0, 16):
>
I have a tensor which is (7,16,16), let's denote as Tijk. I need to make
the following function:
for j in range(0, 16):
for k in range(0, 16):
if T0jk == 0:
for i in range(1, 7):
Tijk = 0
Any ideas on how this can be done with Theano's tensor API?
I am currently implementing the fully convolutional regression network
which is outlined in detail in the paper "Synthetic Data for Text
Localisation in Natural Images" by Ankush Gupta et al.
I've got the network model compiled using the Keras API and I'm trying to
implement the custom loss
Welcome, guys!
On Thu, Sep 1, 2016 at 9:53 AM, Billie Rinaldi
wrote:
> Welcome, Mike and Marc!
>
> On Wed, Aug 31, 2016 at 7:58 AM, Josh Elser wrote:
>
> > Hiya folks,
> >
> > I wanted to take a moment to publicly announce some recent additions to
>
This may not be directly related but I've noticed Hadoop packages have been
not uninstalling/updating well the past year or so. The last couple times
I've run fedup, I've had to go back in manually and remove/update a bunch
of the Hadoop packages like Zookeeper and Parquet.
On Thu, Jun 2, 2016 at
Appears some projects are still being hit as of 11:45am.
On Fri, Apr 22, 2016 at 11:53 AM, Jordan Zimmerman <
jor...@jordanzimmerman.com> wrote:
> I just saw this on the Apache Jira:
>
> "Jira is in Temporary Lockdown mode as a spam countermeasure. Only
> logged-in users with active roles
Not sure if this is related but a bunch of projects in the Apache JIRA got
hit with a strange series of Spam messages in newly created JIRA tickets
yesterday. I know Infra adjusted some of the permissions for users as a
result.
On Fri, Apr 22, 2016 at 10:48 AM, Jordan Zimmerman <
Nevermind, I just noticed the message from infrastructure. Looks like it
affected a bunch of projects.
On Thu, Apr 21, 2016 at 10:07 PM, Corey Nolet <cjno...@gmail.com> wrote:
> Should someone request that this account is suspended until further
> notice? It looks like this person m
on't care what each individual
> event/tuple does, e.g. of you push different event types to separate kafka
> topics and all you care is to do a count, what is the need for single event
> processing.
>
> On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet <cjno...@gmail.c
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
One thing I've noticed about Flink in my following of the project has been
that it has established, in a few cases, some novel ideas and improvements
over Spark. The problem with it, however, is that both the development team
and the community around it are very small and many of those novel
d yet in current Linux
> >> (work
> >>> going on currently to address this). So you're left with hugetlbfs,
> which
> >>> involves static allocations and much more pain.
> >>>
> >>> All the above is a long way to say: let's make sure we do
Gerald,
In order to unsubscribe from this lister, you need to send an email to
user-unsubscr...@hadoop.apache.org.
On Wed, Mar 16, 2016 at 4:39 AM, Gerald-G wrote:
>
>
I was seeing Netty's unsafe classes being used here, not mapped byte
buffer not sure if that statement is completely correct but I'll have to
dog through the code again to figure that out.
The more I was looking at unsafe, it makes sense why that would be
used.apparently it's also supposed to be
Congrats!
On Thu, Mar 3, 2016 at 4:40 AM, Lorand Bendig wrote:
> Congratulations!
> --Lorand
> On Feb 24, 2016 22:30, "Rohini Palaniswamy"
> wrote:
>
> > It is my pleasure to announce that Xuefu Zhang is our newest addition to
> > the Pig PMC. Xuefu
Nevermind, a look @ the ExternalSorter class tells me that the iterator for
each key that's only partially ordered ends up being merge sorted by
equality after the fact. Wanted to post my finding on here for others who
may have the same questions.
On Tue, Mar 1, 2016 at 3:05 PM, Corey Nolet
.
How can this be assumed if the object used for the key, for instance, in
the case where a HashPartitioner is used, cannot assume ordering and
therefore cannot assume a comparator can be used?
On Tue, Mar 1, 2016 at 2:56 PM, Corey Nolet <cjno...@gmail.com> wrote:
> So if I'
So if I'm using reduceByKey() with a HashPartitioner, I understand that the
hashCode() of my key is used to create the underlying shuffle files.
Is anything other than hashCode() used in the shuffle files when the data
is pulled into the reducers and run through the reduce function? The reason
spark dev people will say.
> Corey do you have presentation available online?
>
> On 8 February 2016 at 05:16, Corey Nolet <cjno...@gmail.com> wrote:
>
>> Charles,
>>
>> Thank you for chiming in and I'm glad someone else is experiencing this
>> too and n
Congrats guys!
On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu wrote:
> Congratulations, Herman and Wenchen.
>
> On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia
> wrote:
>
>> Hi all,
>>
>> The PMC has recently added two new Spark committers -- Herman van
of children and doesn't even run concurrently with any other stages
so I ruled out the concurrency of the stages as a culprit for the
shuffliing problem we're seeing.
On Sun, Feb 7, 2016 at 7:49 AM, Corey Nolet <cjno...@gmail.com> wrote:
> Igor,
>
> I don't think the question is "wh
by key or something it should be
> ok, so some detail is missing...skewed data? aggregate by key?
>
> On 6 February 2016 at 20:13, Corey Nolet <cjno...@gmail.com> wrote:
>
>> Igor,
>>
>> Thank you for the response but unfortunately, the problem I'm referring
ey:
>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are
>> your input files lzo format, and you use sc.text() ? If memory is not
>> enough, spark will spill 3-4x of input data to disk.
>>
>>
>> -- 原始邮件 ---
The whole purpose of Apache mailing lists is that the messages get indexed
all over the web so that discussions and questions/solutions can be
searched easily by google and other engines.
For this reason, and the messages being sent via email as Steve pointed
out, it's just not possible to
rtitions
> play with shuffle memory fraction
>
> in spark 1.6 cache vs shuffle memory fractions are adjusted automatically
>
> On 5 February 2016 at 23:07, Corey Nolet <cjno...@gmail.com> wrote:
>
>> I just recently had a discovery that my jobs were taking several hours to
&
I just recently had a discovery that my jobs were taking several hours to
completely because of excess shuffle spills. What I found was that when I
hit the high point where I didn't have enough memory for the shuffles to
store all of their file consolidations at once, it could spill so many
times
David,
Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?
On Tue, Jan 12, 2016 at 10:03 AM, David
wrote:
> Hi all,
>
> I'd like to share news of the recent release of a new Spark
David,
Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?
On Tue, Jan 12, 2016 at 10:03 AM, David
wrote:
> Hi all,
>
> I'd like to share news of the recent release of a new Spark
/Federation.html
On Fri, Sep 25, 2015 at 12:42 AM, Ashish Kumar9 <ashis...@in.ibm.com> wrote:
> This is interesting . Can you share any blog/document that talks
> multi-volume HDFS instances .
>
> Thanks and Regards,
> Ashish Kumar
>
>
> From:Corey Nolet <cjn
If the hardware is drastically different, I would think a multi-volume HDFS
instance would be a good idea (put like-hardware in the same volumes).
On Mon, Sep 21, 2015 at 3:29 PM, Tushar Kapila wrote:
> Would only matter if OS specific communication was being used between
>
Mohamed,
Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA
was added to this recently and seasonal ARIMA should be following shortly.
[1] https://github.com/cloudera/spark-timeseries
On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar
wrote:
>
Sven,
What version of Accumulo are you running? We have a ticket for this [1]
which has had a lot of discussion on it. Christopher Tubbs mentioned that
he had gotten this to work.
[1] https://issues.apache.org/jira/browse/ACCUMULO-1378
On Wed, Sep 16, 2015 at 9:20 AM, Sven Hodapp
>>
>> Currently I'm using version 1.7
>>
>> Regards,
>> Sven
>>
>> - Ursprüngliche Mail -
>> > Von: "Corey Nolet" <cjno...@gmail.com>
>> > An: "user" <user@accumulo.apache.org>
>> > Gesendet: Mittwoc
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over
Unfortunately, MongoDB does not directly expose its locality via its client
API so the problem with trying to schedule Spark tasks against it is that
the tasks themselves cannot be scheduled locally on nodes containing query
results- which means you can only assume most results will be sent over
Usually more information as to the cause of this will be found down in your
logs. I generally see this happen when an out of memory exception has
occurred for one reason or another on an executor. It's possible your
memory settings are too small per executor or the concurrent number of
tasks you
1) Spark only needs to shuffle when data needs to be partitioned around the
workers in an all-to-all fashion.
2) Multi-stage jobs that would normally require several map reduce jobs,
thus causing data to be dumped to disk between the jobs can be cached in
memory.
I've been using SparkConf on my project for quite some time now to store
configuration information for its various components. This has worked very
well thus far in situations where I have control over the creation of the
SparkContext the SparkConf.
I have run into a bit of a problem trying to
related logs can be found in RM ,NM, DN, NN log files in detail.
Thanks again.
On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet cjno...@gmail.com wrote:
Elkhan,
What does the ResourceManager say about the final status of the job?
Spark jobs that run as Yarn applications can fail but still
Elkhan,
What does the ResourceManager say about the final status of the job? Spark
jobs that run as Yarn applications can fail but still successfully clean up
their resources and give them back to the Yarn cluster. Because of this,
there's a difference between your code throwing an exception in
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for
some reason, the inferSchema tools in Spark SQL extracts the schema of
nested JSON objects as StructTypes.
This makes it really confusing when trying to rectify the object hierarchy
when I have maps because the Catalyst
doesn't have
differentiated data structures so we go with the one that gives you more
information when doing inference by default. If you pass in a schema to
JSON however, you can override this and have a JSON object parsed as a map.
On Fri, Jul 17, 2015 at 11:02 AM, Corey Nolet cjno
+1 on the happy hour!
On Mon, Jul 6, 2015 at 5:58 PM, Eric Newton eric.new...@gmail.com wrote:
More importantly, when are we going to have a happy hour to celebrate?
-Eric
On Mon, Jul 6, 2015 at 4:04 PM, Josh Elser josh.el...@gmail.com wrote:
Thanks to the efforts spearheaded by
of the data in the partition (fetching more than 1 record @ a time).
On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet cjno...@gmail.com wrote:
I don't know exactly what's going on under the hood but I would not assume
that just because a whole partition is not being pulled into memory @ one
time
I don't know exactly what's going on under the hood but I would not assume
that just because a whole partition is not being pulled into memory @ one
time that that means each record is being pulled at 1 time. That's the
beauty of exposing Iterators Iterables in an API rather than collections-
I've seen a few places where it's been mentioned that after a shuffle each
reducer needs to pull its partition into memory in its entirety. Is this
true? I'd assume the merge sort that needs to be done (in the cases where
sortByKey() is not used) wouldn't need to pull all of the data into memory
If you use rdd.mapPartitions(), you'll be able to get a hold of the
iterators for each partiton. Then you should be able to do
iterator.grouped(size) on each of the partitions. I think it may mean you
have 1 element at the end of each partition that may have less than size
elements. If that's okay
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341
On Thu, Jun 18, 2015 at 7:55 PM, Du Li l...@yahoo-inc.com.invalid wrote:
repartition() means coalesce(shuffle=false)
On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com
wrote:
Doesn't
I'm confused about this. The comment on the function seems to indicate
that there is absolutely no shuffle or network IO but it also states that
it assigns an even number of parent partitions to each final partition
group. I'm having trouble seeing how this can be guaranteed without some
data
at 5:51 PM, Corey Nolet cjno...@gmail.com wrote:
An example of being able to do this is provided in the Spark Jetty Server
project [1]
[1] https://github.com/calrissian/spark-jetty-server
On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi all,
Is there any way
Doesn't repartition call coalesce(shuffle=true)?
On Jun 18, 2015 6:53 PM, Du Li l...@yahoo-inc.com.invalid wrote:
I got the same problem with rdd,repartition() in my streaming app, which
generated a few huge partitions and many tiny partitions. The resulting
high data skew makes the processing
So I've seen in the documentation that (after the overhead memory is
subtracted), the memory allocations of each executor are as follows (assume
default settings):
60% for cache
40% for tasks to process data
Reading about how Spark implements shuffling, I've also seen it say 20% of
executor
An example of being able to do this is provided in the Spark Jetty Server
project [1]
[1] https://github.com/calrissian/spark-jetty-server
On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi all,
Is there any way running Spark job in programmatic way on Yarn
I've become accustomed to being able to use system properties to override
properties in the Hadoop Configuration objects. I just recently noticed
that when Spark creates the Hadoop Configuraiton in the SparkContext, it
cycles through any properties prefixed with spark.hadoop. and add those
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
://github.com/apache/spark/pull/5403
On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote:
Is it possible to configure Spark to do all of its shuffling FULLY in
memory (given that I have enough memory to store all the data)?
the OS buffer cache and
not ever touch spinning disk if it is a size that is less than memory
on the machine.
- Patrick
On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote:
So with this... to help my understanding of Spark under the hood-
Is this statement correct When
...@cloudera.com wrote:
Hi Corey,
As of this PR https://github.com/apache/spark/pull/5297/files, this can
be controlled with spark.yarn.submit.waitAppCompletion.
-Sandy
On Thu, May 28, 2015 at 11:48 AM, Corey Nolet cjno...@gmail.com wrote:
I am submitting jobs to my yarn cluster via the yarn
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm
noticing the jvm that fires up to allocate the resources, etc... is not
going away after the application master and executors have been allocated.
Instead, it just sits there printing 1 second status updates to the
console.
[
https://issues.apache.org/jira/browse/ACCUMULO-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14560353#comment-14560353
]
Corey Nolet commented on ACCUMULO-1444:
---
My apologies for the late reply
Agreed.
Apache user lists archive questions and answers specifically for the
purpose of helping the larger community navigate its projects. It is not a
place for classifieds and employment information.
On Sun, May 17, 2015 at 9:24 PM, Billy Watson williamrwat...@gmail.com
wrote:
Uh, it's not
I'm firing up a KafkaServer (using some EmbeddedKafkaBroker code that I
found on Github) so that I can run an end-to-end test ingesting data
through a kafka topic with consumers in Spark Streaming pushing to
Accumulo.
Thus far, my code is doing this:
1) Creating a MiniAccumuloCluster and
as the leader but it's strange that the log messages
above seem like they are missing the data. New topic creation callback for
seems like it should be listing a topic and not blank.
Any ideas?
On Thu, May 14, 2015 at 1:00 PM, Corey Nolet cjno...@gmail.com wrote:
I'm firing up a KafkaServer (using some
of it that are making it
unparseable once pulled from zookeeper.
Any ideas to what this could be? I'm using 0.8.2.0- this is really what's
holding me back right now from getting my tests functional.
On Thu, May 14, 2015 at 4:29 PM, Corey Nolet cjno...@gmail.com wrote:
I raised the log levels to try to figure
Json encoded blob definitely appears to be going in as a json string. The
partition assignment json seems to be the only thing that is being prefixed
by these bytes. Any ideas?
On Thu, May 14, 2015 at 5:17 PM, Corey Nolet cjno...@gmail.com wrote:
I think I figured out what the problem
I can get a 1.6.3 together.
On Tue, May 12, 2015 at 2:04 PM, Christopher ctubb...@apache.org wrote:
Sure, we can discuss that separately. I'll start a new thread.
--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
On Tue, May 12, 2015 at 1:58 PM, Sean Busbey bus...@cloudera.com
That is, unless any of the new committers would like to take it on- in that
case, I can help ;-)
On Tue, May 12, 2015 at 3:41 PM, Corey Nolet cjno...@gmail.com wrote:
I can get a 1.6.3 together.
On Tue, May 12, 2015 at 2:04 PM, Christopher ctubb...@apache.org wrote:
Sure, we can discuss
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD call. After the input format runs, I want to do some directory
cleanup and I want to block while I'm doing that. Is that something I can
do inside of this function? If not, where would I accomplish this on every
It does look the function that's executed is in the driver so doing an
Await.result() on a thread AFTER i've executed an action should work. Just
updating this here in case anyone has this question in the future.
Is this somehtign I can do. I am using a FileOutputFormat inside of the
foreachRDD
Vaibnav,
The difference in an OR iterator is that you will want it to return a
single key for all of the given OR terms so that the iterator in the stack
above it would see it was a single hit. It's essentially a merge at the
key level to stop duplicate results from being returned (thus appearing
A tad off topic, but could still be relevant.
Accumulo's design is a tad different in the realm of being able to shard
and perform set intersections/unions server-side (through seeks). I've got
an adapter for Spark SQL on top of a document store implementation in
Accumulo that accepts the
Andrew,
Have you considered leveraging existing SQL query layers like Hive or
Spark's SQL/DataFrames API? There are some pretty massive optimizations
involved in that API making the push-down predicates / selections pretty
easy to adapt for Accumulo.
On Mon, Apr 27, 2015 at 8:37 PM, Andrew Wells
I'm always looking for places to help out and integrate/share designs
ideas. I look forward to chatting with you about Q4A at the hackathon
tomorrow!
Have you, by chance, seen the Spark SQL adapter for the Accumulo Recipes
Event Entity Stores [1]? At the very least, it's a good example of using
Giovanni,
The DAG can be walked by calling the dependencies() function on any RDD.
It returns a Seq containing the parent RDDs. If you start at the leaves
and walk through the parents until dependencies() returns an empty Seq, you
ultimately have your DAG.
On Sat, Apr 25, 2015 at 1:28 PM, Akhil
I have a cluster of 3 nodes and I've created a topic with some number of
partitions and some number of replicas, let's say 10 and 2, respectively.
Later, after I've got my 3 nodes fairly consumed with data in the 10
partitions, I want to add 2 more nodes to the mix to help balance out the
If you return an iterable, you are not tying the API to a compactbuffer.
Someday, the data could be fetched lazily and he API would not have to
change.
On Apr 23, 2015 6:59 PM, Dean Wampler deanwamp...@gmail.com wrote:
I wasn't involved in this decision (I just make the fries), but
tried this?
Within a window you would probably take the first x% as training and
the rest as test. I don't think there's a question of looking across
windows.
On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote:
Surprised I haven't gotten any responses about this. Has
How hard would it be to expose this in some way? I ask because the current
textFile and objectFile functions are obviously at some point calling out
to a FileInputFormat and configuring it.
Could we get a way to configure any arbitrary inputformat / outputformat?
for ARIMA models?
On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote:
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched
Taking out the complexity of the ARIMA models to simplify things- I can't
seem to find a good way to represent even standard moving averages in spark
streaming. Perhaps it's my ignorance with the micro-batched style of the
DStreams API.
On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno
I want to use ARIMA for a predictive model so that I can take time series
data (metrics) and perform a light anomaly detection. The time series data
is going to be bucketed to different time units (several minutes within
several hours, several hours within several days, several days within
several
Spark uses a SerializableWritable [1] to java serialize writable objects.
I've noticed (at least in Spark 1.2.1) that it breaks down with some
objects when Kryo is used instead of regular java serialization. Though it
is wrapping the actual AccumuloInputFormat (another example of something
you
I would do sum square. This would allow you to keep an ongoing value as an
associative operation (in an aggregator) and then calculate the variance
std deviation after the fact.
On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang hw...@qilinsoft.com wrote:
Hi,
I have a DataFrame object and I
Given the following scenario:
dstream.map(...).filter(...).window(...).foreachrdd()
When would the onBatchCompleted fire?
+1
On Tue, Mar 10, 2015 at 10:57 AM, David Medinets david.medin...@gmail.com
wrote:
+1
On Tue, Mar 10, 2015 at 10:56 AM, Adam Fuchs afu...@apache.org wrote:
+1
Adam
On Mar 10, 2015 2:48 AM, Sean Busbey bus...@cloudera.com wrote:
Hi Accumulo!
This is the VOTE thread following
. The batching in new producer is
per topic partition, the batch size it is controlled by both max batch
size and linger time config.
Jiangjie (Becket) Qin
On 3/9/15, 10:10 AM, Corey Nolet cjno...@gmail.com wrote:
I'm curious what type of batching Kafka producers do at the socket layer
I'm new to Kafka and I'm trying to understand the version semantics. We
want to use Kafka w/ Spark but our version of Spark is tied to 0.8.0. We
were wondering what guarantees are made about backwards compatbility across
0.8.x.x. At first glance, given the 3 digits used for versions, I figured
I'm curious what type of batching Kafka producers do at the socket layer.
For instance, if I have a partitioner that round robin's n messages to a
different partition, am I guaranteed to get n different messages sent over
the socket or is there some micro-batching going on underneath?
I am trying
+1 (non-binding)
- Verified signatures
- Built on Mac OS X and Fedora 21.
On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar ksanka...@gmail.com wrote:
Excellent, Thanks Xiangrui. The mystery is solved.
Cheers
k/
On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng men...@gmail.com wrote:
1 - 100 of 409 matches
Mail list logo