Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
Thanks for taking this on Ted! On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote: I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
+1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at

Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
if there was an automatic partition reconfiguration function that automagically did that... On Tue, Feb 24, 2015 at 3:20 AM, Corey Nolet cjno...@gmail.com wrote: I *think* this may have been related to the default memory overhead setting being too low. I raised the value to 1G it and tried my job again

Re: Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
be listening to a partition. Yes, my understanding is that multiple receivers in one group are the way to consume a topic's partitions in parallel. On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet cjno...@gmail.com wrote: Looking @ [1], it seems to recommend pull from multiple Kafka topics in order

Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
Looking @ [1], it seems to recommend pull from multiple Kafka topics in order to parallelize data received from Kafka over multiple nodes. I notice in [2], however, that one of the createConsumer() functions takes a groupId. So am I understanding correctly that creating multiple DStreams with the

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhang zzh...@hortonworks.com wrote: Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: Ted. That one I know. It was the dependency part I

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I see the rdd.dependencies() function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say rdd2.dependencies.contains(rdd1)? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
be the behavior and myself and all my coworkers expected. On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet cjno...@gmail.com wrote: I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
in almost all cases. That much, I don't know how hard it is to implement. But I speculate that it's easier to deal with it at that level than as a function of the dependency graph. On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to do the scheduling myself now

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoopFile(…)] In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen concurrently. Thanks. Zhan Zhang On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.com wrote: What confused me

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi def getRDDStorageInfo: Array[RDDInfo] = { Cheers On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet cjno...@gmail.com wrote: Zhan, This is exactly what I'm

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
This vote was supposed to close on Saturday but it looks like no PMCs voted (other than the implicit vote from Patrick). Was there a discussion offline to cut an RC2? Was the vote extended? On Mon, Feb 23, 2015 at 6:59 AM, Robin East robin.e...@xense.co.uk wrote: Running ec2 launch scripts

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-23 Thread Corey Nolet
SPARK-5183 SPARK-5180 Document data source API SPARK-3650 Triangle Count handles reverse edges incorrectly SPARK-3511 Create a RELEASE-NOTES.txt file in the repo On Mon, Feb 23, 2015 at 1:55 PM, Corey Nolet cjno...@gmail.com wrote: This vote was supposed to close on Saturday but it looks

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
: Could you try to turn on the external shuffle service? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
- but i have a suspicion this may have been the cause of the executors being killed by the application master. On Feb 23, 2015 5:25 PM, Corey Nolet cjno...@gmail.com wrote: I've got the opposite problem with regards to partitioning. I've got over 6000 partitions for some of these RDDs which

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet
I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a

[ANNOUNCE] Apache Accumulo 1.6.2 Released

2015-02-19 Thread Corey Nolet
The Apache Accumulo project is happy to announce its 1.6.2 release. Version 1.6.2 is the most recent bug-fix release in its 1.6.x release line. This version includes numerous bug fixes as well as a performance improvement over previous versions. Existing users of 1.6.x are encouraged to upgrade

Re: [VOTE] Release Apache Spark 1.3.0 (RC1)

2015-02-19 Thread Corey Nolet
+1 (non-binding) - Verified signatures using [1] - Built on MacOSX Yosemite - Built on Fedora 21 Each build was run with and Hadoop-2.4 version with yarn, hive, and hive-thriftserver profiles I am having trouble getting all the tests passing on a single run on both machines but we have this

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-18 Thread Corey Nolet
Thanks Keith!. Josh deserves credit for the release notes. We'll publish the site and I'll get the announcement together. On Wed, Feb 18, 2015 at 11:34 AM, Josh Elser josh.el...@gmail.com wrote: +1 ditto. Mirrors appear updated as well. I just fixed another s/1.6.1/1.6.2/ on the sidebar. I

[ANNOUNCE] Apache Accumulo 1.6.2 Released

2015-02-18 Thread Corey Nolet
The Apache Accumulo project is happy to announce its 1.6.2 release. Version 1.6.2 is the most recent bug-fix release in its 1.6.x release line. This version includes numerous bug fixes as well as a performance improvement over previous versions. Existing users of 1.6.x are encouraged to upgrade

Fwd: [ANNOUNCE] Apache Accumulo 1.6.2 Released

2015-02-18 Thread Corey Nolet
Forwarding to dev. -- Forwarded message -- From: Corey Nolet cjno...@apache.org Date: Wed, Feb 18, 2015 at 12:25 PM Subject: [ANNOUNCE] Apache Accumulo 1.6.2 Released To: u...@accumulo.apache.org, annou...@apache.org The Apache Accumulo project is happy to announce its 1.6.2

Re: Replacing Jetty with TomCat

2015-02-17 Thread Corey Nolet
Niranda, I'm not sure if I'd say Spark's use of Jetty to expose its UI monitoring layer constitutes a use of two web servers in a single product. Hadoop uses Jetty as well as do many other applications today that need embedded http layers for serving up their monitoring UI to users. This is

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Corey Nolet
We've been using commons configuration to pull our properties out of properties files and system properties (prioritizing system properties over others) and we add those properties to our spark conf explicitly and we use ArgoPartser to get the command line argument for which property file to load.

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-15 Thread Corey Nolet
...@gmail.com wrote: Great work, Corey! What else do we need to do? Release notes? Do you have the javadoc/artifact deployments under control? Corey Nolet wrote: The vote is now closed. The release of Apache Accumulo 1.6.2 RC5 has been accepted with 3 +1's and 0 -1's. On Fri, Feb 13

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-15 Thread Corey Nolet
Billie took on the user manual last time. I'm still not sure how to build the website output for that. On Sun, Feb 15, 2015 at 8:58 AM, Corey Nolet cjno...@gmail.com wrote: Josh- I'm terribly busy this weekend but I am going to tackle the release notes, publishing the artifacts to the website

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-14 Thread Corey Nolet
.Because of ACCUMULO-3597, I was not able to get a long randomwalk run. The bug happened shortly after starting the test. I killed the deadlocked tserver and everything started running again. On Wed, Feb 11, 2015 at 8:52 AM, Corey Nolet cjno...@apache.org wrote: Devs, Please

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-13 Thread Corey Nolet
NOTICE in native.tar.gz Corey Nolet wrote: Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc5 SHA1: 42943a1817434f1f32e9f0224941aa2fff162e74 Staging Repository: https://repository.apache.org/content/repositories

SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I have a SELECT * from myTable WHERE locations_$homeAddress = '123 Elm St' It's telling me $ is invalid. Is this a bug?

Re: [VOTE] Apache Accumulo 1.6.2 RC5

2015-02-13 Thread Corey Nolet
time at which the RC5 was announced, which was 2pm UTC on Wednesday, February 11th. That would make the vote close on Saturday, February 14th at 2pm UTC (9am EST, 6am PT) On Fri, Feb 13, 2015 at 1:38 PM, Corey Nolet cjno...@gmail.com wrote: Thanks Josh for your verification. Just a reminder

Re: SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
This doesn't seem to have helped. On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust mich...@databricks.com wrote: Try using `backticks` to escape non-standard characters. On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote: I don't remember Oracle ever enforcing that I

Re: Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Nevermind- I think I may have had a schema-related issue (sometimes booleans were represented as string and sometimes as raw booleans but when I populated the schema one or the other was chosen. On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet cjno...@gmail.com wrote: Here are the results

Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Here are the results of a few different SQL strings (let's assume the schemas are valid for the data types used): SELECT * from myTable where key1 = true - no filters are pushed to my PrunedFilteredScan SELECT * from myTable where key1 = true and key2 = 5 - 1 filter (key2) is pushed to my

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the spark.kryo.registrator property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
the data to a single partition (no matter what window I set) and it seems to lock up my jobs. I waited for 15 minutes for a stage that usually takes about 15 seconds and I finally just killed the job in yarn. On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet cjno...@gmail.com wrote: So I tried

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

[VOTE] Apache Accumulo 1.6.2 RC5

2015-02-11 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc5 SHA1: 42943a1817434f1f32e9f0224941aa2fff162e74 Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1024/ Source tarball:

Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word partition here is a tad different than the term partition that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet cjno...@gmail.com wrote: I am using an input format to load data from Accumulo [1] in to a Spark

Re: [VOTE] Apache Accumulo 1.6.2 RC4

2015-02-10 Thread Corey Nolet
billion entries. https://issues.apache.org/jira/browse/ACCUMULO-3576 On Thu, Feb 5, 2015 at 11:00 PM, Corey Nolet cjno...@apache.org wrote: Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc4 SHA1

Re: [VOTE] Apache Accumulo 1.6.2 RC4

2015-02-06 Thread Corey Nolet
. -- Christopher L Tubbs II http://gravatar.com/ctubbsii On Thu, Feb 5, 2015 at 11:00 PM, Corey Nolet cjno...@apache.org wrote: Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc4 SHA1: 0649982c2e395852ce2e4408d283a40d6490a980

[VOTE] Apache Accumulo 1.6.2 RC4

2015-02-05 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc4 SHA1: 0649982c2e395852ce2e4408d283a40d6490a980 Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1022/ Source tarball:

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke charles.fed...@gmail.com wrote: If you want to design something like Spark shell have a look at:

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-02-04 Thread Corey Nolet
in instance.volumes Thanks, Corey Nolet

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
My mistake Marcello, I was looking at the wrong message. That reply was meant for bo yang. On Feb 4, 2015 4:02 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Corey, On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote: Another suggestion is to build Spark by yourself

[jira] [Commented] (ACCUMULO-3549) tablet server location cache may grow too large

2015-02-04 Thread Corey Nolet (JIRA)
[ https://issues.apache.org/jira/browse/ACCUMULO-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14305913#comment-14305913 ] Corey Nolet commented on ACCUMULO-3549: --- So we're comfortable with this change

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
works for YARN). Also thread at http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html . HTH, Markus On 02/03/2015 11:20 PM, Corey Nolet wrote: I'm having a really bad dependency conflict right now with Guava versions between my Spark

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-02-04 Thread Corey Nolet
, 2015 at 10:36 AM, Keith Turner ke...@deenlo.com wrote: On Thu, Jan 29, 2015 at 7:27 PM, Corey Nolet cjno...@gmail.com wrote: However I am seeing ACCUMULO-3545[1] that I need to investigate. Ok. I'll cut another RC as soon as that's complete. Verification completed

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my

Re: Welcoming three new committers

2015-02-03 Thread Corey Nolet
Congrats guys! On Tue, Feb 3, 2015 at 7:01 PM, Evan Chan velvia.git...@gmail.com wrote: Congrats everyone!!! On Tue, Feb 3, 2015 at 3:17 PM, Timothy Chen tnac...@gmail.com wrote: Congrats all! Tim On Feb 4, 2015, at 7:10 AM, Pritish Nawlakhe prit...@nirvana-international.com

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them. Each call to save as a sequence file appears to have done its work, the task says it completed in xxx.x seconds but then it pauses

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-01-29 Thread Corey Nolet
had one IT that failed on me from the source build which we can fix later -- things are looking good otherwise from my testing. Thanks for working through this Corey, and Keith for finding bugs :) Corey Nolet wrote: Devs, Please consider the following candidate for Apache Accumulo

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet cjno...@gmail.com wrote: In all of the soutions I've found thus far, sorting has been

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-01-28 Thread Corey Nolet
at 2:38 AM, Corey Nolet cjno...@apache.org https://mail.google.com/mail/?view=cmfs=1tf=1to=cjno...@apache.org wrote: Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc3 SHA1

Re: [VOTE] Apache Accumulo 1.6.2 RC3

2015-01-28 Thread Corey Nolet
I'll start on an RC4 but leave this open for awhile in case any more issues like pop up like this. On Jan 28, 2015 5:24 PM, Keith Turner ke...@deenlo.com wrote: -1 because of ACCUMULO-3541 On Wed, Jan 28, 2015 at 2:38 AM, Corey Nolet cjno...@apache.org wrote: Devs, Please consider

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet cjno...@gmail.com wrote: I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed

Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-01-27 Thread Corey Nolet
I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property spark.shuffle.service.enabled in the spark-defaults.conf. The code mentions

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
, Corey Nolet cjno...@gmail.com wrote: I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running

[VOTE] Apache Accumulo 1.6.2 RC3

2015-01-27 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc3 SHA1: 3a6987470c1e5090a2ca159614a80f0fa50393bf Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1021/ Source tarball:

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
--- Basic build with unit tests. Thanks, Corey Nolet

Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
/util/HadoopCompatUtil.java PRE-CREATION examples/simple/src/main/java/org/apache/accumulo/examples/simple/mapreduce/TeraSortIngest.java 1b8cbaf Diff: https://reviews.apache.org/r/30280/diff/ Testing --- Basic build with unit tests. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
. Thanks, Corey Nolet

Re: Review Request 30280: ACCUMULO-3533 Making AbstractInputFormat.getConfiguration() protected to match backwards compatibility with 1.6.1

2015-01-26 Thread Corey Nolet
--- Basic build with unit tests. Thanks, Corey Nolet

Re: Review Request 30252: ACCUMULO-3531 update japi-compliance-check configs.

2015-01-26 Thread Corey Nolet
I believe Josh just committed a fix for the missing license header. On Mon, Jan 26, 2015 at 1:24 PM, Mike Drob md...@mdrob.com wrote: --- This is an automatically generated e-mail. To reply, visit:

Re: Review Request 30252: ACCUMULO-3531 update japi-compliance-check configs.

2015-01-25 Thread Corey Nolet
/30252/#comment114283 Good. I'll add this to the release documentation I've been working on. - Corey Nolet On Jan. 25, 2015, 9:38 a.m., Sean Busbey wrote: --- This is an automatically generated e-mail. To reply, visit: https

Re: [VOTE] Apache Accumulo 1.6.2 RC2

2015-01-25 Thread Corey Nolet
(under a semver patch increment, this should be just as strong an assertion as the reverse) http://people.apache.org/~busbey/compat_reports/accumulo/1.6.2_to_1.6.1/compat_report.html On Fri, Jan 23, 2015 at 8:02 PM, Corey Nolet cjno...@apache.org wrote: Devs, Please consider the following

Re: [VOTE] Apache Accumulo 1.6.2 RC2

2015-01-25 Thread Corey Nolet
Forwarding discussions to dev. On Jan 25, 2015 3:22 PM, Josh Elser josh.el...@gmail.com wrote: plus, I don't think it's valid to call this vote on the user list :) Corey Nolet wrote: -1 for backwards compatibility issues described. -1 Corey, I'm really sorry for the churn. I thought I

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-23 Thread Corey Nolet
Elser josh.el...@gmail.com wrote: I think we used to have instruction lying around that described how to use https://github.com/lvc/japi-compliance-checker (not like that has any influence on what Sean used, though :D) Corey Nolet wrote: Sean- is this what you were using [1]? [1

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-22 Thread Corey Nolet
, 2015 at 7:50 PM, Corey Nolet cjno...@gmail.com wrote: I did notice something strange reviewing this RC. It appears the staging repo doesn't have hash files for the detached GPG signatures (*.asc.md5, *.asc.sha1). That's new. Did you do something special regarding this, Corey

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-21 Thread Corey Nolet
I did notice something strange reviewing this RC. It appears the staging repo doesn't have hash files for the detached GPG signatures (*.asc.md5, *.asc.sha1). That's new. Did you do something special regarding this, Corey? Or maybe this is just a change with mvn, or maybe it's a change with

Re: [VOTE] Apache Accumulo 1.6.2 RC1

2015-01-21 Thread Corey Nolet
20, 2015 at 11:18 PM, Corey Nolet cjno...@apache.org wrote: Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc1 SHA1: 533d93adb17e8b27c5243c97209796f66c6b8b2d Staging Repository: https://repository.apache.org

[VOTE] Apache Accumulo 1.6.2 RC1

2015-01-20 Thread Corey Nolet
Devs, Please consider the following candidate for Apache Accumulo 1.6.2 Branch: 1.6.2-rc1 SHA1: 533d93adb17e8b27c5243c97209796f66c6b8b2d Staging Repository: https://repository.apache.org/content/repositories/orgapacheaccumulo-1018/ Source tarball:

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-19 Thread Corey Nolet
., Corey Nolet wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29959/ --- (Updated Jan. 16, 2015, 5:06 a.m.) Review request

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-19 Thread Corey Nolet
and a replaced volume appears in instance.volumes. Also verified that the error does not appear when 'bin/accumulo init --add-volumes' is called and the replaced volume does not appear in instance.volumes Thanks, Corey Nolet

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
, Jan 17, 2015 at 4:29 PM, Michael Armbrust mich...@databricks.com wrote: How are you running your test here? Are you perhaps doing a .count()? On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet cjno...@gmail.com wrote: Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being pushed down to the DataRelation are not the product of the SELECT clause but rather just the columns explicitly included in the WHERE clause. Examples from my testing: SELECT * FROM myTable -- The required columns are

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
an example [1] of what I'm trying to accomplish. [1] https://github.com/calrissian/accumulo-recipes/blob/273/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStore.scala#L49 On Fri, Jan 16, 2015 at 10:17 PM, Corey Nolet cjno...@gmail.com wrote: Hao, Thanks so much

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Corey Nolet
There's also an example of running a SparkContext in a java servlet container from Calrissian: https://github.com/calrissian/spark-jetty-server On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote: The question is about the ways to create a Windows desktop-based and/or

Re: Spark SQL Custom Predicate Pushdown

2015-01-16 Thread Corey Nolet
Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources *From:* Corey Nolet

Re: Spark SQL API changes and stabilization

2015-01-15 Thread Corey Nolet
Reynold, One thing I'd like worked into the public portion of the API is the json inferencing logic that creates a Set[(String, StructType)] out of Map[String,Any]. SPARK-5260 addresses this so that I can use Accumulators to infer my schema instead of forcing a map/reduce phase to occur on an RDD

Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-15 Thread Corey Nolet
in instance.volumes. Also verified that the error does not appear when 'bin/accumulo init --add-volumes' is called and the replaced volume does not appear in instance.volumes Thanks, Corey Nolet

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-15 Thread Corey Nolet
/Initialize.java https://reviews.apache.org/r/29959/#comment112605 Just noticed this. We should certainly have the conversation to standardize on this. I don't mind doing what everyone's been doing, I just need to know what that is. - Corey Nolet On Jan. 16, 2015, 4:37 a.m., Corey Nolet wrote

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-15 Thread Corey Nolet
., Corey Nolet wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29959/ --- (Updated Jan. 16, 2015, 4:37 a.m.) Review request

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-15 Thread Corey Nolet
and a replaced volume appears in instance.volumes. Also verified that the error does not appear when 'bin/accumulo init --add-volumes' is called and the replaced volume does not appear in instance.volumes Thanks, Corey Nolet

Re: Review Request 29959: ACCUMULO-2793 Adding non-HA to HA migration info to user manual and log error when improperly configuring instance.volumes.

2015-01-15 Thread Corey Nolet
and a replaced volume appears in instance.volumes. Also verified that the error does not appear when 'bin/accumulo init --add-volumes' is called and the replaced volume does not appear in instance.volumes Thanks, Corey Nolet

Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Corey Nolet
I have document storage services in Accumulo that I'd like to expose to Spark SQL. I am able to push down predicate logic to Accumulo to have it perform only the seeks necessary on each tablet server to grab the results being asked for. I'm interested in using Spark SQL to push those predicates

Custom JSON schema inference

2015-01-14 Thread Corey Nolet
I'm working with RDD[Map[String,Any]] objects all over my codebase. These objects were all originally parsed from JSON. The processing I do on RDDs consists of parsing json - grouping/transforming dataset into a feasible report - outputting data to a file. I've been wanting to infer the schemas

Re: Accumulators

2015-01-14 Thread Corey Nolet
Just noticed an error in my wording. Should be I'm assuming it's not immediately aggregating on the driver each time I call the += on the Accumulator. On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet cjno...@gmail.com wrote: What are the limitations of using Accumulators to get a union of a bunch

Accumulators

2015-01-14 Thread Corey Nolet
What are the limitations of using Accumulators to get a union of a bunch of small sets? Let's say I have an RDD[Map{String,Any} and i want to do: rdd.map(accumulator += Set(_.get(entityType).get)) What implication does this have on performance? I'm assuming it's not immediately aggregating

Re: Web Service + Spark

2015-01-09 Thread Corey Nolet
Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...) and if you are using a cluster manager like Yarn or Mesos vs. just firing up your own executors and master. I recently worked on an example for deploying Spark services

Submitting SparkContext and seeing driverPropsFetcher exception

2015-01-09 Thread Corey Nolet
I'm seeing this exception when creating a new SparkContext in YARN: [ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] - [akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [

Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2015-01-08 Thread Corey Nolet
-CREATION Diff: https://reviews.apache.org/r/29502/diff/ Testing --- Wrote an integration test to verify that ScanDataSource is actually setting the authorizations on the IteratorEnvironment Thanks, Corey Nolet

Re: Review Request 29502: ACCUMULO-3458 Adding scan authorizations to IteratorEnvironment

2015-01-07 Thread Corey Nolet
? Christopher Tubbs wrote: Probably best to just format and organize imports for all the changed files. I noticed a lot of other formatting issues, too. Corey Nolet wrote: Not sure why intelli-j defaults to this behavior but it's fixed. Christopher Tubbs wrote: Import order

<    1   2   3   4   5   >