Streaming anomaly detection using ARIMA

2015-03-27 Thread Corey Nolet
I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Corey Nolet
Spark uses a SerializableWritable [1] to java serialize writable objects. I've noticed (at least in Spark 1.2.1) that it breaks down with some objects when Kryo is used instead of regular java serialization. Though it is wrapping the actual AccumuloInputFormat (another example of something you may

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Corey Nolet
I would do sum square. This would allow you to keep an ongoing value as an associative operation (in an aggregator) and then calculate the variance & std deviation after the fact. On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang wrote: > Hi, > > > > I have a DataFrame object and I want to do types

StreamingListener

2015-03-11 Thread Corey Nolet
Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?

Re: [VOTE] Establishing a contrib repo for upgrade testing

2015-03-10 Thread Corey Nolet
+1 On Tue, Mar 10, 2015 at 10:57 AM, David Medinets wrote: > +1 > > On Tue, Mar 10, 2015 at 10:56 AM, Adam Fuchs wrote: > > +1 > > > > Adam > > On Mar 10, 2015 2:48 AM, "Sean Busbey" wrote: > > > >> Hi Accumulo! > >> > >> This is the VOTE thread following our DISCUSS thread on establishing a >

Re: Batching at the socket layer

2015-03-10 Thread Corey Nolet
batching in new producer is > per topic partition, the batch size it is controlled by both max batch > size and linger time config. > > Jiangjie (Becket) Qin > > On 3/9/15, 10:10 AM, "Corey Nolet" wrote: > > >I'm curious what type of batching Kafka prod

Re: [VOTE] Release Apache Spark 1.3.0 (RC3)

2015-03-09 Thread Corey Nolet
+1 (non-binding) - Verified signatures - Built on Mac OS X and Fedora 21. On Mon, Mar 9, 2015 at 11:01 PM, Krishna Sankar wrote: > Excellent, Thanks Xiangrui. The mystery is solved. > Cheers > > > > On Mon, Mar 9, 2015 at 3:30 PM, Xiangrui Meng wrote: > > > Krishna, I tested your linear regre

Fwd: Verioning

2015-03-09 Thread Corey Nolet
I'm new to Kafka and I'm trying to understand the version semantics. We want to use Kafka w/ Spark but our version of Spark is tied to 0.8.0. We were wondering what guarantees are made about backwards compatbility across 0.8.x.x. At first glance, given the 3 digits used for versions, I figured 0.8.

Fwd: Batching at the socket layer

2015-03-09 Thread Corey Nolet
I'm curious what type of batching Kafka producers do at the socket layer. For instance, if I have a partitioner that round robin's n messages to a different partition, am I guaranteed to get n different messages sent over the socket or is there some micro-batching going on underneath? I am trying

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
Thanks for taking this on Ted! On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu wrote: > I have created SPARK-6085 with pull request: > https://github.com/apache/spark/pull/4836 > > Cheers > > On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet wrote: > >> +1 to a better def

Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
me-consuming jobs. Imagine if there was an > automatic partition reconfiguration function that automagically did that... > > > On Tue, Feb 24, 2015 at 3:20 AM, Corey Nolet wrote: > >> I *think* this may have been related to the default memory overhead >> setting being too lo

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
+1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at 11:

Re: Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
tively be listening to a > partition. > > Yes, my understanding is that multiple receivers in one group are the > way to consume a topic's partitions in parallel. > > On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet wrote: > > Looking @ [1], it seems to recommend pull f

Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
Looking @ [1], it seems to recommend pull from multiple Kafka topics in order to parallelize data received from Kafka over multiple nodes. I notice in [2], however, that one of the createConsumer() functions takes a groupId. So am I understanding correctly that creating multiple DStreams with the s

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
:31 AM, Zhan Zhang > wrote: > > Currently in spark, it looks like there is no easy way to know the > > dependencies. It is solved at run time. > > > > Thanks. > > > > Zhan Zhang > > > > On Feb 26, 2015, at 4:20 PM, Corey Nolet wrote: > > > > Ted. That one I know. It was the dependency part I was curious about >

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
xt has this method: >* Return information about what RDDs are cached, if they are in mem or > on disk, how much space >* they take, etc. >*/ > @DeveloperApi > def getRDDStorageInfo: Array[RDDInfo] = { > > Cheers > > On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
.map().() > rdd1.count > future { rdd1.saveAsHasoopFile(...) } > future { rdd2.saveAsHadoopFile(…)] > > In this way, rdd1 will be calculated once, and two saveAsHadoopFile will > happen concurrently. > > Thanks. > > Zhan Zhang > > > > On Feb 26, 2015

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
d be the behavior and myself and all my coworkers expected. On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet wrote: > I should probably mention that my example case is much over simplified- > Let's say I've got a tree, a fairly complex one where I begin a series of > jobs at the ro

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
partition of rdd1 even when the rest is ready. > > That is probably usually a good idea in almost all cases. That much, I > don't know how hard it is to implement. But I speculate that it's > easier to deal with it at that level than as a function of the > dependency gr

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
and trigger the execution > if there is no shuffle dependencies in between RDDs. > > Thanks. > > Zhan Zhang > On Feb 26, 2015, at 1:28 PM, Corey Nolet wrote: > > > Let's say I'm given 2 RDDs and told to store them in a sequence file and > they have the fo

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I see the "rdd.dependencies()" function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say "rdd2.dependencies.contains(rdd1)"? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet wrote: > Let's say I'm given 2 RDDs and told to store t

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2 depend

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
ll see tomorrow- but i have a suspicion this may have been the cause of the executors being killed by the application master. On Feb 23, 2015 5:25 PM, "Corey Nolet" wrote: > I've got the opposite problem with regards to partitioning. I've got over > 6000 partitions for s

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
t; too few in the beginning, the problems seems to decrease. Also, increasing > spark.akka.askTimeout and spark.core.connection.ack.wait.timeout > significantly (~700 secs), the problems seems to almost disappear. Don't > wont to celebrate yet, still long way left before the job complet

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
t;> fraction of the Executor heap will be used for your user code vs the >> shuffle vs RDD caching with the spark.storage.memoryFraction setting. >> >> On Sat, Feb 21, 2015 at 2:58 PM, Petar Zecevic >> wrote: >> >>> >>> Could you try to

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
x Parquet filter push-down > SPARK-5310 SPARK-5166 Update SQL programming guide for 1.3 > SPARK-5183 SPARK-5180 Document data source API > SPARK-3650 Triangle Count handles reverse edges incorrectly > SPARK-3511 Create a RELEASE-NOTES.txt file in the repo > > > On Mon, Feb 23, 20

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted (other than the implicit vote from Patrick). Was there a discussion offline to cut an RC2? Was the vote extended? On Mon, Feb 23, 2015 at 6:59 AM, Robin East wrote: > Running ec2 launch scripts gives me the following err

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet
I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a s

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Corey Nolet
+1 (non-binding) - Verified signatures using [1] - Built on MacOSX Yosemite - Built on Fedora 21 Each build was run with and Hadoop-2.4 version with yarn, hive, and hive-thriftserver profiles I am having trouble getting all the tests passing on a single run on both machines but we have this same

[ANNOUNCE] Apache Accumulo 1.6.2 Released

2015-02-19 Thread Corey Nolet
The Apache Accumulo project is happy to announce its 1.6.2 release. Version 1.6.2 is the most recent bug-fix release in its 1.6.x release line. This version includes numerous bug fixes as well as a performance improvement over previous versions. Existing users of 1.6.x are encouraged to upgrade to

Fwd: [ANNOUNCE] Apache Accumulo 1.6.2 Released

2015-02-18 Thread Corey Nolet
Forwarding to dev. -- Forwarded message -- From: Corey Nolet Date: Wed, Feb 18, 2015 at 12:25 PM Subject: [ANNOUNCE] Apache Accumulo 1.6.2 Released To: u...@accumulo.apache.org, annou...@apache.org The Apache Accumulo project is happy to announce its 1.6.2 release. Version

[ANNOUNCE] Apache Accumulo 1.6.2 Released

2015-02-18 Thread Corey Nolet
The Apache Accumulo project is happy to announce its 1.6.2 release. Version 1.6.2 is the most recent bug-fix release in its 1.6.x release line. This version includes numerous bug fixes as well as a performance improvement over previous versions. Existing users of 1.6.x are encouraged to upgrade to

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-18 Thread Corey Nolet
k we're all good. > > > Keith Turner wrote: > >> Corey thanks for doing this release. I took a look at the release notes >> on >> staging, looks good. >> >> >> >> On Wed, Feb 11, 2015 at 8:52 AM, Corey Nolet wrote: >> >>

Re: Replacing Jetty with TomCat

2015-02-17 Thread Corey Nolet
Niranda, I'm not sure if I'd say Spark's use of Jetty to expose its UI monitoring layer constitutes a use of "two web servers in a single product". Hadoop uses Jetty as well as do many other applications today that need embedded http layers for serving up their monitoring UI to users. This is comp

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Corey Nolet
We've been using commons configuration to pull our properties out of properties files and system properties (prioritizing system properties over others) and we add those properties to our spark conf explicitly and we use ArgoPartser to get the command line argument for which property file to load.

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-15 Thread Corey Nolet
Billie took on the user manual last time. I'm still not sure how to build the website output for that. On Sun, Feb 15, 2015 at 8:58 AM, Corey Nolet wrote: > Josh- I'm terribly busy this weekend but I am going to tackle the release > notes, publishing the artifacts to the websi

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-15 Thread Corey Nolet
sh Elser wrote: > Great work, Corey! > > What else do we need to do? Release notes? Do you have the > javadoc/artifact deployments under control? > > > Corey Nolet wrote: > >> The vote is now closed. The release of Apache Accumulo 1.6.2 RC5 has been >> accepted wi

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-14 Thread Corey Nolet
n 1.6.2.Because of ACCUMULO-3597, I was not > able to get a long randomwalk run. The bug happened shortly after > starting the test. I killed the deadlocked tserver and everything started > running again. > > > > On Wed, Feb 11, 2015 at 8:52 AM, Corey Nolet wrote: &g

Re: Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Nevermind- I think I may have had a schema-related issue (sometimes booleans were represented as string and sometimes as raw booleans but when I populated the schema one or the other was chosen. On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet wrote: > Here are the results of a few different

Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Here are the results of a few different SQL strings (let's assume the schemas are valid for the data types used): SELECT * from myTable where key1 = true -> no filters are pushed to my PrunedFilteredScan SELECT * from myTable where key1 = true and key2 = 5 -> 1 filter (key2) is pushed to my Prune

Re: SparkSQL doesn't seem to like "$"'s in column names

2015-02-13 Thread Corey Nolet
This doesn't seem to have helped. On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust wrote: > Try using `backticks` to escape non-standard characters. > > On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet wrote: > >> I don't remember Oracle ever enforcing that I couldn&

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-13 Thread Corey Nolet
72 hours after time at which the RC5 was announced, which was 2pm UTC on Wednesday, February 11th. That would make the vote close on Saturday, February 14th at 2pm UTC (9am EST, 6am PT) On Fri, Feb 13, 2015 at 1:38 PM, Corey Nolet wrote: > Thanks Josh for your verification. Just a reminder tha

SparkSQL doesn't seem to like "$"'s in column names

2015-02-13 Thread Corey Nolet
I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I have a "SELECT * from myTable WHERE locations_$homeAddress = '123 Elm St'" It's telling me $ is invalid. Is this a bug?

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-13 Thread Corey Nolet
* Verified NOTICE in native.tar.gz > > > Corey Nolet wrote: > >>Devs, >> >> Please consider the following candidate for Apache Accumulo 1.6.2 >> >> Branch: 1.6.2-rc5 >> SHA1: 42943a1817434f1f32e9f0224941aa2fff162e74 >>

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
Ok. I just verified that this is the case with a little test: WHERE (a = 'v1' and b = 'v2')PrunedFilteredScan passes down 2 filters WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0 filters On Fri, Feb 13, 2015

Re: Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
tDate).toDate > }.getOrElse() > val end = filters.find { > case LessThan("end", endDate: String) => DateTime.parse(endDate).toDate > }.getOrElse() > > ... > > Filters are advisory, so you can ignore ones that aren't start/end. > > Michael > > On

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current implementation

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
ng all the data to a single partition (no matter what window I set) and it seems to lock up my jobs. I waited for 15 minutes for a stage that usually takes about 15 seconds and I finally just killed the job in yarn. On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet wrote: > So I tried this: > >

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
group should need to fit. > > On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet wrote: > >> Doesn't iter still need to fit entirely into memory? >> >> On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra >> wrote: >> >>> rdd.mapPartitions { iter =

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the "spark.kryo.registrator" property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet wrote: > I'm trying to register a custom class that extends Kryo's Serializer > interface. I can

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

Re: Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra wrote: > rdd.mapPartitions { iter => > val grouped = iter.grouped(batchSize) > for (group <- grouped) { ... } > } > > On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet

Easy way to "partition" an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word "partition" here is a tad different than the term "partition" that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

[VOTE] Apache Accumulo 1.6.2 RC5

2015-02-11 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc5 SHA1: 42943a1817434f1f32e9f0224941aa2fff162e74 Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1024/ Source tarball: https://repository.apache.

Re: [VOTE] Apache Accumulo 1.6.2 RC4

2015-02-10 Thread Corey Nolet
> w/ agitation, ran for 26 hrs and wrote 21 billion entries. > > <https://issues.apache.org/jira/browse/ACCUMULO-3576> > > On Thu, Feb 5, 2015 at 11:00 PM, Corey Nolet wrote: > > > Devs, > > > > Please consider the fo

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet wrote: > I am using an input format to load data from Accumulo [1] in to a Spark > RD

Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am using an input format to load data from Accumulo [1] in to a Spark RDD. It looks like something is happening in the serialization of my output writable between the time it is emitted from the InputFormat and the time it reaches its destination on the driver. What's happening is that the resul

Re: [VOTE] Apache Accumulo 1.6.2 RC4

2015-02-06 Thread Corey Nolet
e included in RC4. > > > -- > Christopher L Tubbs II > http://gravatar.com/ctubbsii > > On Thu, Feb 5, 2015 at 11:00 PM, Corey Nolet wrote: > > > Devs, > > > > Please consider the following candidate for Apache Accumulo 1.6.2 > > > > B

[VOTE] Apache Accumulo 1.6.2 RC4

2015-02-05 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc4 SHA1: 0649982c2e395852ce2e4408d283a40d6490a980 Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1022/ Source tarball: https://repository.apache.

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke wrote: > If you want to design something like Spark shell have a look at: > > http://zeppelin-project.

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-02-04 Thread Corey Nolet
our effort. > > -Eric > > On Fri, Jan 30, 2015 at 10:36 AM, Keith Turner wrote: > > > On Thu, Jan 29, 2015 at 7:27 PM, Corey Nolet wrote: > > > > > > However I am seeing ACCUMULO-3545[1] that > > > I need to investigate. > > > > > >

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
My mistake Marcello, I was looking at the wrong message. That reply was meant for bo yang. On Feb 4, 2015 4:02 PM, "Marcelo Vanzin" wrote: > Hi Corey, > > On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet wrote: > >> Another suggestion is to build Spark by yourself. >

[jira] [Commented] (ACCUMULO-3549) tablet server location cache may grow too large

2015-02-04 Thread Corey Nolet (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305913#comment-14305913 ] Corey Nolet commented on ACCUMULO-3549: --- So we're comfortable with th

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
ith Spark 1.1 and earlier you'd get >> Guava 14 from Spark, so still a problem for you). >> >> Right now, the option Markus mentioned >> (spark.yarn.user.classpath.first) can be a workaround for you, since >> it will place your app's jars before Yarn's on the classpath. >> >

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-02-04 Thread Corey Nolet
replaced volume does not appear in instance.volumes Thanks, Corey Nolet

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
.org/jira/browse/SPARK-2996 - only works for YARN). >> Also thread at >> http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html >> . >> >> HTH, >> Markus >> >> On 02/03/2015 11:20 PM, Corey Nolet wrot

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my R

Re: Welcoming three new committers

2015-02-03 Thread Corey Nolet
Congrats guys! On Tue, Feb 3, 2015 at 7:01 PM, Evan Chan wrote: > Congrats everyone!!! > > On Tue, Feb 3, 2015 at 3:17 PM, Timothy Chen wrote: > > Congrats all! > > > > Tim > > > > > >> On Feb 4, 2015, at 7:10 AM, Pritish Nawlakhe < > prit...@nirvana-international.com> wrote: > >> > >> Congrats

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them. Each call to save as a sequence file appears to have done its work, the task says it completed in "xxx.x seconds" but then it pauses

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-01-29 Thread Corey Nolet
tests. I had one IT that failed on me from the source > build which we can fix later -- things are looking good otherwise from my > testing. > > Thanks for working through this Corey, and Keith for finding bugs :) > > > Corey Nolet wrote: > >>Devs, >> >>

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
e/src/main/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet wrote: > I'm looking @ the ShuffledRDD code and it looks like there is a method > setKeyOrdering()- is this guaranteed to order everything in the partition? > I'm on S

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-01-28 Thread Corey Nolet
t;https://mail.google.com/mail/?view=cm&fs=1&tf=1&to=ctubb...@apache.org > >> > > wrote: > > > > > Does it matter that this was built with Java 1.7.0_25? Is that going to > > > cause issues running in a 1.6 JRE? > > > > > > > > > --

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-01-28 Thread Corey Nolet
I'll start on an RC4 but leave this open for awhile in case any more issues like pop up like this. On Jan 28, 2015 5:24 PM, "Keith Turner" wrote: > -1 because of ACCUMULO-3541 > > On Wed, Jan 28, 2015 at 2:38 AM, Corey Nolet wrote: > > > Devs, > >

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet wrote: > In all of the soutions I've found thus far, sorting h

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
y-spark-one-spark-job > > On Wed, Jan 28, 2015 at 12:51 AM, Corey Nolet wrote: > >> I need to be able to take an input RDD[Map[String,Any]] and split it into >> several different RDDs based on some partitionable piece of the key >> (groups) and then send each partition to

[VOTE] Apache Accumulo 1.6.2 RC3

2015-01-27 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc3 SHA1: 3a6987470c1e5090a2ca159614a80f0fa50393bf Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1021/ Source tarball: https://repository.apache.

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
51 AM, Corey Nolet wrote: > I need to be able to take an input RDD[Map[String,Any]] and split it into > several different RDDs based on some partitionable piece of the key > (groups) and then send each partition to a separate set of files in > different folders in HDFS. > > 1

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the

Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-01-27 Thread Corey Nolet
I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property "spark.shuffle.service.enabled" in the spark-defaults.conf. The code mentions t

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
ting --- Basic build with unit tests. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
ting --- Basic build with unit tests. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
ests. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
ests. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
ttps://reviews.apache.org/r/30280/diff/ Testing --- Basic build with unit tests. Thanks, Corey Nolet

Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
mulo/core/util/HadoopCompatUtil.java PRE-CREATION examples/simple/src/main/java/org/apache/accumulo/examples/simple/mapreduce/TeraSortIngest.java 1b8cbaf Diff: https://reviews.apache.org/r/30280/diff/ Testing --- Basic build with unit tests. Thanks, Corey Nolet

Re: Review Request 30252: ACCUMULO-3531 update japi-compliance-check configs.

2015-01-26 Thread Corey Nolet
I believe Josh just committed a fix for the missing license header. On Mon, Jan 26, 2015 at 1:24 PM, Mike Drob wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/30252/#review69636 >

Re: [VOTE] Apache Accumulo 1.6.2 RC2

2015-01-25 Thread Corey Nolet
Christopher, I see what I did in regards to the commit hash- I based the rc2 branch off of the branch I ran the maven release plugin from instead of basing it off the tag which was created. On Sun, Jan 25, 2015 at 3:38 PM, Corey Nolet wrote: > Forwarding discussions to dev. > On Jan 25

Re: [VOTE] Apache Accumulo 1.6.2 RC2

2015-01-25 Thread Corey Nolet
Forwarding discussions to dev. On Jan 25, 2015 3:22 PM, "Josh Elser" wrote: > plus, I don't think it's valid to call this vote on the user list :) > > Corey Nolet wrote: > >> -1 for backwards compatibility issues described. >> >> -1 >> &

Re: [VOTE] Apache Accumulo 1.6.2 RC2

2015-01-25 Thread Corey Nolet
mulo/1.6.1_to_1.6.2/compat_report.html * 1.6.2 -> 1.6.1 (under a semver patch increment, this should be just as strong an assertion as the reverse) http://people.apache.org/~busbey/compat_reports/accumulo/1.6.2_to_1.6.1/compat_report.html On Fri, Jan 23, 2015 at 8:02 PM, Corey Nolet wrote:

Re: Review Request 30252: ACCUMULO-3531 update japi-compliance-check configs.

2015-01-25 Thread Corey Nolet
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30252/#review69571 --- Ship it! Ship It! - Corey Nolet On Jan. 25, 2015, 9:38 a.m

Re: Review Request 30252: ACCUMULO-3531 update japi-compliance-check configs.

2015-01-25 Thread Corey Nolet
g/r/30252/#comment114283> Good. I'll add this to the release documentation I've been working on. - Corey Nolet On Jan. 25, 2015, 9:38 a.m., Sean Busbey wrote: > > --- > This is an automatically generated e-mail.

[VOTE] Apache Accumulo 1.6.2 RC2

2015-01-23 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc2 SHA1: 34987b4c8b4d896bbf2d26be8e70f70976614c0f Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1020/ Source tarball: https://repository.apache.

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-23 Thread Corey Nolet
5 at 11:56 PM, Josh Elser wrote: > > > I think we used to have instruction lying around that described how to > use > > https://github.com/lvc/japi-compliance-checker (not like that has any > > influence on what Sean used, though :D) > > > > > > Corey Nolet

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-22 Thread Corey Nolet
gt; On Wed, Jan 21, 2015 at 7:50 PM, Corey Nolet wrote: > > > > I did notice something strange reviewing this RC. It appears the > staging > > > repo doesn't have hash files for the detached GPG signatures > (*.asc.md5, > > > *.asc.sha1). That's new.

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-21 Thread Corey Nolet
> I did notice something strange reviewing this RC. It appears the staging > repo doesn't have hash files for the detached GPG signatures (*.asc.md5, > *.asc.sha1). That's new. Did you do something special regarding this, > Corey? Or maybe this is just a change with mvn, or maybe it's a change with

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-21 Thread Corey Nolet
onality to the public API. > > > > nice catch > > -1 > > > > > > On Tue, Jan 20, 2015 at 11:18 PM, Corey Nolet > wrote: > > > > > Devs, > > > > > > Please consider the following candidate for Apache Accum

[SQL] Conflicts in inferred Json Schemas

2015-01-21 Thread Corey Nolet
Let's say I have 2 formats for json objects in the same file schema1 = { "location": "12345 My Lane" } schema2 = { "location":{"houseAddres":"1234 My Lane"} } >From my tests, it looks like the current inferSchema() function will end up with only StructField("location", StringType). What would be

[VOTE] Apache Accumulo 1.6.2 RC1

2015-01-20 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc1 SHA1: 533d93adb17e8b27c5243c97209796f66c6b8b2d Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1018/ Source tarball: https://repository.apach

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-19 Thread Corey Nolet
lumes' is called and a replaced volume appears in instance.volumes. Also verified that the error does not appear when 'bin/accumulo init --add-volumes' is called and the replaced volume does not appear in instance.volumes Thanks, Corey Nolet

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-19 Thread Corey Nolet
--------- On Jan. 16, 2015, 5:06 a.m., Corey Nolet wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/29959/ >

<    1   2   3   4   5   >