Re: Pyarrow Plasma client.release() fault

2018-07-20 Thread Corey Nolet
t > > process or the store). > > > > Can you check if the object store evicting objects (it prints something > to > > stdout/stderr when this happens)? Could you be running out of memory but > > failing to release the objects? > > > > On Tue, Jul 10, 201

Re: Pyarrow Plasma client.release() fault

2018-07-10 Thread Corey Nolet
art the Jupyter kernel. On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet wrote: > Wes, > > Unfortunately, my code is on a separate network. I'll try to explain what > I'm doing and if you need further detail, I can certainly pseudocode > specifics. > > I am using multiproce

Re: Pyarrow Plasma client.release() fault

2018-07-10 Thread Corey Nolet
gure out where and what I'm doing. Thanks again! On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney wrote: > hi Corey, > > Can you provide the code (or a simplified version thereof) that shows > how you're using Plasma? > > - Wes > > On Tue, Jul 10, 2018 at 11:45 AM,

Pyarrow Plasma client.release() fault

2018-07-10 Thread Corey Nolet
I'm on a system with 12TB of memory and attempting to use Pyarrow's Plasma client to convert a series of CSV files (via Pandas) into a Parquet store. I've got a little over 20k CSV files to process which are about 1-2gb each. I'm loading 500 to 1000 files at a time. In each iteration, I'm loading

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet
ark-on-a-single-node-machine.html > > regars, > > 2018-05-23 22:30 GMT+02:00 Corey Nolet : > >> Please forgive me if this question has been asked already. >> >> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >> anyone know

PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet
Please forgive me if this question has been asked already. I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of r

Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Corey Nolet
lient object successfully (it has > a socket connection to the store). > > On Wed, May 16, 2018 at 3:43 PM Corey Nolet wrote: > >> Robert, >> >> Thank you for the quick response. I've been playing around for a few hours >> to get a feel for how this

Re: PyArrow & Python Multiprocessing

2018-05-16 Thread Corey Nolet
think of it as a replacement for > Python multiprocessing that automatically uses shared memory and Arrow for > serialization. > > On Wed, May 16, 2018 at 10:02 AM Corey Nolet wrote: > > > I've been reading through the PyArrow documentation and trying to > > understand

PyArrow & Python Multiprocessing

2018-05-16 Thread Corey Nolet
I've been reading through the PyArrow documentation and trying to understand how to use the tool effectively for IPC (using zero-copy). I'm on a system with 586 cores & 1TB of ram. I'm using Panda's Dataframes to process several 10's of gigs of data in memory and the pickling that is done by Pytho

Re: Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
tions until the other user gets worked into the model. On Mon, Nov 27, 2017 at 3:08 PM, Corey Nolet wrote: > I'm trying to use the MatrixFactorizationModel to, for instance, determine > the latent factors of a user or item that were not used in the training > data of the model. I

Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
I'm trying to use the MatrixFactorizationModel to, for instance, determine the latent factors of a user or item that were not used in the training data of the model. I'm not as concerned about the rating as I am with the latent factors for the user/item. Thanks!

[theano-users] Re: Apply generic element wise function against tensor

2017-01-17 Thread Corey Nolet
ply(tensor2, e_tensor[i,:,:])) output = np.array(new_tensor).reshape(7,16,16) On Tuesday, January 17, 2017 at 10:47:24 AM UTC-5, Corey Nolet wrote: > > I have a tensor which is (7,16,16), let's denote as Tijk. I need to make > the following function: > > for j in range(0,

[theano-users] Apply generic element wise function against tensor

2017-01-17 Thread Corey Nolet
I have a tensor which is (7,16,16), let's denote as Tijk. I need to make the following function: for j in range(0, 16): for k in range(0, 16): if T0jk == 0: for i in range(1, 7): Tijk = 0 Any ideas on how this can be done with Theano's tensor API? Thank

[theano-users] Complex if/else operations in a custom loss function on 3-dimensional tensors

2017-01-13 Thread Corey Nolet
I am currently implementing the fully convolutional regression network which is outlined in detail in the paper "Synthetic Data for Text Localisation in Natural Images" by Ankush Gupta et al. I've got the network model compiled using the Keras API and I'm trying to implement the custom loss la

Re: New Committers/PMC members!

2016-09-01 Thread Corey Nolet
Welcome, guys! On Thu, Sep 1, 2016 at 9:53 AM, Billie Rinaldi wrote: > Welcome, Mike and Marc! > > On Wed, Aug 31, 2016 at 7:58 AM, Josh Elser wrote: > > > Hiya folks, > > > > I wanted to take a moment to publicly announce some recent additions to > > the Apache Accumulo family (committers and

Re: Hadoop

2016-06-02 Thread Corey Nolet
This may not be directly related but I've noticed Hadoop packages have been not uninstalling/updating well the past year or so. The last couple times I've run fedup, I've had to go back in manually and remove/update a bunch of the Hadoop packages like Zookeeper and Parquet. On Thu, Jun 2, 2016 at

RE: Account hacked?

2016-04-22 Thread Corey Nolet
On the Accumulo project, we were getting spam up until 2 hours ago. We were still using hadoops permissions scheme. I think k the lockdown only works if you are using the default permissions scheme. Once we flipped to default, the spam stopped On Apr 22, 2016 3:11 PM, "Zheng, Kai" wrote: > > Fr

Re: adding ACL enforcement based on ACLProvider, for consistency

2016-04-22 Thread Corey Nolet
Appears some projects are still being hit as of 11:45am. On Fri, Apr 22, 2016 at 11:53 AM, Jordan Zimmerman < jor...@jordanzimmerman.com> wrote: > I just saw this on the Apache Jira: > > "Jira is in Temporary Lockdown mode as a spam countermeasure. Only > logged-in users with active roles (commit

Re: adding ACL enforcement based on ACLProvider, for consistency

2016-04-22 Thread Corey Nolet
Not sure if this is related but a bunch of projects in the Apache JIRA got hit with a strange series of Spam messages in newly created JIRA tickets yesterday. I know Infra adjusted some of the permissions for users as a result. On Fri, Apr 22, 2016 at 10:48 AM, Jordan Zimmerman < jor...@jordanzimm

Re: [jira] [Deleted] (ARROW-117) AVG SUPPORT 18004923958 avg Antivirus tech support number avg Antivirus Help Desk Number customer care

2016-04-21 Thread Corey Nolet
Nevermind, I just noticed the message from infrastructure. Looks like it affected a bunch of projects. On Thu, Apr 21, 2016 at 10:07 PM, Corey Nolet wrote: > Should someone request that this account is suspended until further > notice? It looks like this person may have been compromised.

Re: [jira] [Deleted] (ARROW-117) AVG SUPPORT 18004923958 avg Antivirus tech support number avg Antivirus Help Desk Number customer care

2016-04-21 Thread Corey Nolet
Should someone request that this account is suspended until further notice? It looks like this person may have been compromised. On Thu, Apr 21, 2016 at 7:08 PM, Daniel Takamori (JIRA) wrote: > > [ > https://issues.apache.org/jira/browse/ARROW-117?page=com.atlassian.jira.plugin.system.issue

Re: Apache Flink

2016-04-17 Thread Corey Nolet
you don't care what each individual > event/tuple does, e.g. of you push different event types to separate kafka > topics and all you care is to do a count, what is the need for single event > processing. > > On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet wrote: > >> i ha

Re: Apache Flink

2016-04-17 Thread Corey Nolet
> Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > > On

Re: Apache Flink

2016-04-17 Thread Corey Nolet
One thing I've noticed about Flink in my following of the project has been that it has established, in a few cases, some novel ideas and improvements over Spark. The problem with it, however, is that both the development team and the community around it are very small and many of those novel improv

Re: Understanding "shared" memory implications

2016-03-19 Thread Corey Nolet
t in current Linux > >> (work > >>> going on currently to address this). So you're left with hugetlbfs, > which > >>> involves static allocations and much more pain. > >>> > >>> All the above is a long way to say: let's make sure

Re: unsubscribe

2016-03-19 Thread Corey Nolet
Gerald, In order to unsubscribe from this lister, you need to send an email to user-unsubscr...@hadoop.apache.org. On Wed, Mar 16, 2016 at 4:39 AM, Gerald-G wrote: > >

Re: Understanding "shared" memory implications

2016-03-15 Thread Corey Nolet
I was seeing Netty's unsafe classes being used here, not mapped byte buffer not sure if that statement is completely correct but I'll have to dog through the code again to figure that out. The more I was looking at unsafe, it makes sense why that would be used.apparently it's also supposed to be

Re: I setup a slack team to have a live channel to discuss Arrow

2016-03-11 Thread Corey Nolet
Myself as well, thanks! On Fri, Mar 11, 2016 at 3:26 PM, pino patera wrote: > Would like an invite! Thx > On Fri, 11 Mar 2016 at 21:19, Dale Jung wrote: > > > Would love an invite as well. Thanks! > > > > On Fri, Mar 11, 2016 at 3:15 PM, Prateek Rungta > > wrote: > > > > > Could I get an invi

Re: Welcome to our new Pig PMC member Xuefu Zhang

2016-03-04 Thread Corey Nolet
Congrats! On Thu, Mar 3, 2016 at 4:40 AM, Lorand Bendig wrote: > Congratulations! > --Lorand > On Feb 24, 2016 22:30, "Rohini Palaniswamy" > wrote: > > > It is my pleasure to announce that Xuefu Zhang is our newest addition to > > the Pig PMC. Xuefu is a long time committer of Pig and has been

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
Nevermind, a look @ the ExternalSorter class tells me that the iterator for each key that's only partially ordered ends up being merge sorted by equality after the fact. Wanted to post my finding on here for others who may have the same questions. On Tue, Mar 1, 2016 at 3:05 PM, Corey

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
How can this be assumed if the object used for the key, for instance, in the case where a HashPartitioner is used, cannot assume ordering and therefore cannot assume a comparator can be used? On Tue, Mar 1, 2016 at 2:56 PM, Corey Nolet wrote: > So if I'm using reduceByKey() with a HashPa

Shuffle guarantees

2016-03-01 Thread Corey Nolet
So if I'm using reduceByKey() with a HashPartitioner, I understand that the hashCode() of my key is used to create the underlying shuffle files. Is anything other than hashCode() used in the shuffle files when the data is pulled into the reducers and run through the reduce function? The reason I'm

Re: unsubscribe please

2016-02-24 Thread Corey Nolet
Russel, Please send a message to dev-unsubscr...@arrow.apache.org On Wed, Feb 24, 2016 at 2:56 PM, Russell Simmons < russell.emergen...@gmail.com> wrote: > thx >

Re: Question about mutability

2016-02-24 Thread Corey Nolet
t other options do you have in mind? > > On Wed, Feb 24, 2016 at 11:30 AM Corey Nolet wrote: > > > Agreed, > > > > I thought the whole purpose was to share the memory space (using possibly > > unsafe operations like ByteBuffers) so that it could be directly shared

Re: Question about mutability

2016-02-24 Thread Corey Nolet
. > > > > IIUC, with Arrow, when application A shares data with application B, the > > data is still duplicated in the memory spaces of A and B. It's just that > > data serialization/deserialization are much faster with Arrow (compared > > with Protobuf). > > > >

Re: Question about mutability

2016-02-24 Thread Corey Nolet
with Arrow, when application A shares data with application B, the > data is still duplicated in the memory spaces of A and B. It's just that > data serialization/deserialization are much faster with Arrow (compared > with Protobuf). > > On Wed, Feb 24, 2016 at 10:40 AM Corey

Question about mutability

2016-02-24 Thread Corey Nolet
Forgive me if this question seems ill-informed. I just started looking at Arrow yesterday. I looked around the github a tad. Are you expecting the memory space held by one application to be mutable by that application and made available to all applications trying to read the memory space?

Re: Welcoming two new committers

2016-02-08 Thread Corey Nolet
Congrats guys! On Mon, Feb 8, 2016 at 12:23 PM, Ted Yu wrote: > Congratulations, Herman and Wenchen. > > On Mon, Feb 8, 2016 at 9:15 AM, Matei Zaharia > wrote: > >> Hi all, >> >> The PMC has recently added two new Spark committers -- Herman van Hovell >> and Wenchen Fan. Both have been heavily

Re: Shuffle memory woes

2016-02-08 Thread Corey Nolet
ople will say. > Corey do you have presentation available online? > > On 8 February 2016 at 05:16, Corey Nolet wrote: > >> Charles, >> >> Thank you for chiming in and I'm glad someone else is experiencing this >> too and not just me. I know very well how the Spark

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
: >>"The dataset is 100gb at most, the spills can up to 10T-100T", Are >> your input files lzo format, and you use sc.text() ? If memory is not >> enough, spark will spill 3-4x of input data to disk. >> >> >> -- 原始邮件

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
ot of children and doesn't even run concurrently with any other stages so I ruled out the concurrency of the stages as a culprit for the shuffliing problem we're seeing. On Sun, Feb 7, 2016 at 7:49 AM, Corey Nolet wrote: > Igor, > > I don't think the question is "why can

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
n > if map side is ok, and you just reducing by key or something it should be > ok, so some detail is missing...skewed data? aggregate by key? > > On 6 February 2016 at 20:13, Corey Nolet wrote: > >> Igor, >> >> Thank you for the response but unfortunately, the pro

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Corey Nolet
The whole purpose of Apache mailing lists is that the messages get indexed all over the web so that discussions and questions/solutions can be searched easily by google and other engines. For this reason, and the messages being sent via email as Steve pointed out, it's just not possible to retract

Re: Shuffle memory woes

2016-02-06 Thread Corey Nolet
o have more partitions > play with shuffle memory fraction > > in spark 1.6 cache vs shuffle memory fractions are adjusted automatically > > On 5 February 2016 at 23:07, Corey Nolet wrote: > >> I just recently had a discovery that my jobs were taking several hours to >> compl

Shuffle memory woes

2016-02-05 Thread Corey Nolet
I just recently had a discovery that my jobs were taking several hours to completely because of excess shuffle spills. What I found was that when I hit the high point where I didn't have enough memory for the shuffles to store all of their file consolidations at once, it could spill so many times t

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David, Thank you very much for announcing this! It looks like it could be very useful. Would you mind providing a link to the github? On Tue, Jan 12, 2016 at 10:03 AM, David wrote: > Hi all, > > I'd like to share news of the recent release of a new Spark package, ROSE. > > > ROSE is a Scala lib

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David, Thank you very much for announcing this! It looks like it could be very useful. Would you mind providing a link to the github? On Tue, Jan 12, 2016 at 10:03 AM, David wrote: > Hi all, > > I'd like to share news of the recent release of a new Spark package, ROSE. > > > ROSE is a Scala lib

Re: Hetergeneous Hadoop Cluster

2015-09-25 Thread Corey Nolet
adoop-hdfs/Federation.html On Fri, Sep 25, 2015 at 12:42 AM, Ashish Kumar9 wrote: > This is interesting . Can you share any blog/document that talks > multi-volume HDFS instances . > > Thanks and Regards, > Ashish Kumar > > > From:Corey Nolet > To:user@hadoop.apach

Re: Hetergeneous Hadoop Cluster

2015-09-24 Thread Corey Nolet
If the hardware is drastically different, I would think a multi-volume HDFS instance would be a good idea (put like-hardware in the same volumes). On Mon, Sep 21, 2015 at 3:29 PM, Tushar Kapila wrote: > Would only matter if OS specific communication was being used between > nodes. I assume they

Re: Forecasting Library For Apache Spark

2015-09-21 Thread Corey Nolet
Mohamed, Have you checked out the Spark Timeseries [1] project? Non-seasonal ARIMA was added to this recently and seasonal ARIMA should be following shortly. [1] https://github.com/cloudera/spark-timeseries On Mon, Sep 21, 2015 at 7:47 AM, Mohamed Baddar wrote: > Hello everybody , this my firs

Re: Mini Accumulo Cluster reusing the directory

2015-09-16 Thread Corey Nolet
10:40 Sven Hodapp > wrote: > >> Hi Corey, >> >> thanks for your reply and the link. Sounds good, if that will be >> available in the future! >> Is the code from Christopher somewhere deployed? >> >> Currently I'm using version 1.7 >> >

Re: Mini Accumulo Cluster reusing the directory

2015-09-16 Thread Corey Nolet
Sven, What version of Accumulo are you running? We have a ticket for this [1] which has had a lot of discussion on it. Christopher Tubbs mentioned that he had gotten this to work. [1] https://issues.apache.org/jira/browse/ACCUMULO-1378 On Wed, Sep 16, 2015 at 9:20 AM, Sven Hodapp wrote: > Hi t

Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client API so the problem with trying to schedule Spark tasks against it is that the tasks themselves cannot be scheduled locally on nodes containing query results- which means you can only assume most results will be sent over th

Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client API so the problem with trying to schedule Spark tasks against it is that the tasks themselves cannot be scheduled locally on nodes containing query results- which means you can only assume most results will be sent over th

Re: What is the reason for ExecutorLostFailure?

2015-08-18 Thread Corey Nolet
Usually more information as to the cause of this will be found down in your logs. I generally see this happen when an out of memory exception has occurred for one reason or another on an executor. It's possible your memory settings are too small per executor or the concurrent number of tasks you ar

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Corey Nolet
1) Spark only needs to shuffle when data needs to be partitioned around the workers in an all-to-all fashion. 2) Multi-stage jobs that would normally require several map reduce jobs, thus causing data to be dumped to disk between the jobs can be cached in memory.

SparkConf "ignoring" keys

2015-08-05 Thread Corey Nolet
I've been using SparkConf on my project for quite some time now to store configuration information for its various components. This has worked very well thus far in situations where I have control over the creation of the SparkContext & the SparkConf. I have run into a bit of a problem trying to i

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Corey Nolet
I can only give you a general overview of how the Yarn integration works from the Scala point of view. Hope this helps. > Yarn related logs can be found in RM ,NM, DN, NN log files in detail. > > Thanks again. > > On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet wrote: > >>

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-27 Thread Corey Nolet
Elkhan, What does the ResourceManager say about the final status of the job? Spark jobs that run as Yarn applications can fail but still successfully clean up their resources and give them back to the Yarn cluster. Because of this, there's a difference between your code throwing an exception in a

Re: MapType vs StructType

2015-07-17 Thread Corey Nolet
we don't support union types). JSON doesn't have >> differentiated data structures so we go with the one that gives you more >> information when doing inference by default. If you pass in a schema to >> JSON however, you can override this and have a JSON object parsed as a map. &g

MapType vs StructType

2015-07-17 Thread Corey Nolet
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for some reason, the "inferSchema" tools in Spark SQL extracts the schema of nested JSON objects as StructTypes. This makes it really confusing when trying to rectify the object hierarchy when I have maps because the Catalyst c

Re: Post 1.5.3 and 1.6.3

2015-07-06 Thread Corey Nolet
+1 on the happy hour! On Mon, Jul 6, 2015 at 5:58 PM, Eric Newton wrote: > More importantly, when are we going to have a happy hour to celebrate? > > -Eric > > > On Mon, Jul 6, 2015 at 4:04 PM, Josh Elser wrote: > > > Thanks to the efforts spearheaded by Christopher and verified by everyone > >

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
e chunking of the data in the partition (fetching more than 1 record @ a time). On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet wrote: > I don't know exactly what's going on under the hood but I would not assume > that just because a whole partition is not being pulled into memory @

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators & Iterables in an API rather than collections- the

Reducer memory usage

2015-06-21 Thread Corey Nolet
I've seen a few places where it's been mentioned that after a shuffle each reducer needs to pull its partition into memory in its entirety. Is this true? I'd assume the merge sort that needs to be done (in the cases where sortByKey() is not used) wouldn't need to pull all of the data into memory at

Re: Grouping elements in a RDD

2015-06-20 Thread Corey Nolet
If you use rdd.mapPartitions(), you'll be able to get a hold of the iterators for each partiton. Then you should be able to do iterator.grouped(size) on each of the partitions. I think it may mean you have 1 element at the end of each partition that may have less than "size" elements. If that's oka

Re: Welcoming some new committers

2015-06-20 Thread Corey Nolet
Congrats guys! Keep up the awesome work! On Sat, Jun 20, 2015 at 3:28 PM, Guru Medasani wrote: > Congratulations to all the new committers! > > Guru Medasani > gdm...@gmail.com > > > > > On Jun 17, 2015, at 5:12 PM, Matei Zaharia > wrote: > > > > Hey all, > > > > Over the past 1.5 months we add

Coalescing with shuffle = false in imbalanced cluster

2015-06-18 Thread Corey Nolet
I'm confused about this. The comment on the function seems to indicate that there is absolutely no shuffle or network IO but it also states that it assigns an even number of parent partitions to each final partition group. I'm having trouble seeing how this can be guaranteed without some data pass

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341 On Thu, Jun 18, 2015 at 7:55 PM, Du Li wrote: > repartition() means coalesce(shuffle=false) > > > > On Thursday, June 18, 2015 4:07 PM, Corey Nolet > wrote: > > > Doesn't repart

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, "Du Li" wrote: > I got the same problem with rdd,repartition() in my streaming app, which > generated a few huge partitions and many tiny partitions. The resulting > high data skew makes the processing time of a batch unpre

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-18 Thread Corey Nolet
aster("local") > >.setConf(SparkLauncher.DRIVER_MEMORY, "2g") > > .launch(); > > spark.waitFor(); > >} > > } > > } > > > > On Wed, Jun 17, 2015 at 5:51 PM, Corey Nolet wrote: > >> An example of being able to do thi

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Corey Nolet
An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov wrote: > Hi all, > > Is there any way running Spark job in programmatic way on Yarn cluster > without using

Executor memory allocations

2015-06-17 Thread Corey Nolet
So I've seen in the documentation that (after the overhead memory is subtracted), the memory allocations of each executor are as follows (assume default settings): 60% for cache 40% for tasks to process data Reading about how Spark implements shuffling, I've also seen it say "20% of executor mem

Using spark.hadoop.* to set Hadoop properties

2015-06-17 Thread Corey Nolet
I've become accustomed to being able to use system properties to override properties in the Hadoop Configuration objects. I just recently noticed that when Spark creates the Hadoop Configuraiton in the SparkContext, it cycles through any properties prefixed with spark.hadoop. and add those properti

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
fer cache and > not ever touch spinning disk if it is a size that is less than memory > on the machine. > > - Patrick > > On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote: > > So with this... to help my understanding of Spark under the hood- > > > > Is this sta

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
b.com/apache/spark/pull/5403 > > > > On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet wrote: > >> Is it possible to configure Spark to do all of its shuffling FULLY in >> memory (given that I have enough memory to store all the data)? >> >> >> >> >

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
yza wrote: > Hi Corey, > > As of this PR https://github.com/apache/spark/pull/5297/files, this can > be controlled with spark.yarn.submit.waitAppCompletion. > > -Sandy > > On Thu, May 28, 2015 at 11:48 AM, Corey Nolet wrote: > >> I am submitting jobs to my yar

yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm noticing the jvm that fires up to allocate the resources, etc... is not going away after the application master and executors have been allocated. Instead, it just sits there printing 1 second status updates to the console. I

[jira] [Commented] (ACCUMULO-1444) Single Node Accumulo to start the tracer

2015-05-26 Thread Corey Nolet (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-1444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14560353#comment-14560353 ] Corey Nolet commented on ACCUMULO-1444: --- My apologies for the late r

Re: Jr. to Mid Level Big Data jobs in Bay Area

2015-05-17 Thread Corey Nolet
Agreed. Apache user lists archive questions and answers specifically for the purpose of helping the larger community navigate its projects. It is not a place for classifieds and employment information. On Sun, May 17, 2015 at 9:24 PM, Billy Watson wrote: > Uh, it's not about being tolerant. It'

Re: KafkaServer in integration test not properly assigning to leaders to partitions

2015-05-14 Thread Corey Nolet
beginning of it that are making it unparseable once pulled from zookeeper. Any ideas to what this could be? I'm using 0.8.2.0- this is really what's holding me back right now from getting my tests functional. On Thu, May 14, 2015 at 4:29 PM, Corey Nolet wrote: > I raised the log le

Re: KafkaServer in integration test not properly assigning to leaders to partitions

2015-05-14 Thread Corey Nolet
Json encoded blob definitely appears to be going in as a json string. The partition assignment json seems to be the only thing that is being prefixed by these bytes. Any ideas? On Thu, May 14, 2015 at 5:17 PM, Corey Nolet wrote: > I think I figured out what the problem is, though I'm

Re: KafkaServer in integration test not properly assigning to leaders to partitions

2015-05-14 Thread Corey Nolet
t it needs to add the partition assignment to zookeeper with itself as the leader but it's strange that the log messages above seem like they are missing the data. "New topic creation callback for " seems like it should be listing a topic and not blank. Any ideas? On Thu, May 14, 2015

KafkaServer in integration test not properly assigning to leaders to partitions

2015-05-14 Thread Corey Nolet
I'm firing up a KafkaServer (using some EmbeddedKafkaBroker code that I found on Github) so that I can run an end-to-end test ingesting data through a kafka topic with consumers in Spark Streaming pushing to Accumulo. Thus far, my code is doing this: 1) Creating a MiniAccumuloCluster and KafkaSer

Re: 1.5.3 and 1.6.3

2015-05-12 Thread Corey Nolet
That is, unless any of the new committers would like to take it on- in that case, I can help ;-) On Tue, May 12, 2015 at 3:41 PM, Corey Nolet wrote: > I can get a 1.6.3 together. > > > On Tue, May 12, 2015 at 2:04 PM, Christopher wrote: > >> Sure, we can discuss that sep

Re: 1.5.3 and 1.6.3

2015-05-12 Thread Corey Nolet
I can get a 1.6.3 together. On Tue, May 12, 2015 at 2:04 PM, Christopher wrote: > Sure, we can discuss that separately. I'll start a new thread. > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > > On Tue, May 12, 2015 at 1:58 PM, Sean Busbey wrote: > > let's please have a label

Re: Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
It does look the function that's executed is in the driver so doing an Await.result() on a thread AFTER i've executed an action should work. Just updating this here in case anyone has this question in the future. Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD cal

Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD call. After the input format runs, I want to do some directory cleanup and I want to block while I'm doing that. Is that something I can do inside of this function? If not, where would I accomplish this on every micr

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Corey Nolet
A tad off topic, but could still be relevant. Accumulo's design is a tad different in the realm of being able to shard and perform set intersections/unions server-side (through seeks). I've got an adapter for Spark SQL on top of a document store implementation in Accumulo that accepts the push-dow

Re: Running boolean or queries on accumulo

2015-04-30 Thread Corey Nolet
Vaibnav, The difference in an OR iterator is that you will want it to return a single key for all of the given OR terms so that the iterator in the stack above it would see it was a single "hit". It's essentially a merge at the key level to stop duplicate results from being returned (thus appearin

Re: Q4A Project

2015-04-27 Thread Corey Nolet
I'm always looking for places to help out and integrate/share designs & ideas. I look forward to chatting with you about Q4A at the hackathon tomorrow! Have you, by chance, seen the Spark SQL adapter for the Accumulo Recipes Event & Entity Stores [1]? At the very least, it's a good example of usin

Re: Q4A Project

2015-04-27 Thread Corey Nolet
Andrew, Have you considered leveraging existing SQL query layers like Hive or Spark's SQL/DataFrames API? There are some pretty massive optimizations involved in that API making the push-down predicates / selections pretty easy to adapt for Accumulo. On Mon, Apr 27, 2015 at 8:37 PM, Andrew Wells

Re: DAG

2015-04-25 Thread Corey Nolet
Giovanni, The DAG can be walked by calling the "dependencies()" function on any RDD. It returns a Seq containing the parent RDDs. If you start at the leaves and walk through the parents until dependencies() returns an empty Seq, you ultimately have your DAG. On Sat, Apr 25, 2015 at 1:28 PM, Akhi

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Corey Nolet
If you return an iterable, you are not tying the API to a compactbuffer. Someday, the data could be fetched lazily and he API would not have to change. On Apr 23, 2015 6:59 PM, "Dean Wampler" wrote: > I wasn't involved in this decision ("I just make the fries"), but > CompactBuffer is designed fo

Horizontal scaling a topic

2015-04-23 Thread Corey Nolet
I have a cluster of 3 nodes and I've created a topic with some number of partitions and some number of replicas, let's say 10 and 2, respectively. Later, after I've got my 3 nodes fairly consumed with data in the 10 partitions, I want to add 2 more nodes to the mix to help balance out the partition

Re: Streaming anomaly detection using ARIMA

2015-04-10 Thread Corey Nolet
this with Spark Streaming but imagine it would also > work. Have you tried this? > > Within a window you would probably take the first x% as training and > the rest as test. I don't think there's a question of looking across > windows. > > On Thu, Apr 2, 2015 at 12:

SparkR newHadoopAPIRDD

2015-04-01 Thread Corey Nolet
How hard would it be to expose this in some way? I ask because the current textFile and objectFile functions are obviously at some point calling out to a FileInputFormat and configuring it. Could we get a way to configure any arbitrary inputformat / outputformat?

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
used Scalation for ARIMA models? On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet wrote: > Taking out the complexity of the ARIMA models to simplify things- I can't > seem to find a good way to represent even standard moving averages in spark > streaming. Perhaps it's my ignorance w

Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet w

  1   2   3   4   5   >