Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
at 2:19 PM, Corey Nolet cjno...@gmail.com wrote: Is it possible to select if, say, there was an addresses field that had a json array? You can get the Nth item by address.getItem(0). If you want to walk through the whole array look at LATERAL VIEW EXPLODE in HiveQL

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
).collect() res0: Array[org.apache.spark.sql.Row] = Array([John]) This will double show people who have more than one matching address. On Tue, Oct 28, 2014 at 5:52 PM, Corey Nolet cjno...@gmail.com wrote: So it wouldn't be possible to have a json string like this: { name:John, age:53, locations

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
Am I able to do a join on an exploded field? Like if I have another object: { streetNumber:2300, locationName:The Big Building} and I want to join with the previous json by the locations[].number field- is that possible? On Tue, Oct 28, 2014 at 9:31 PM, Corey Nolet cjno...@gmail.com wrote

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-10-28 Thread Corey Nolet
$QueryExecution.sparkPlan(SQLContext.scala:400) On Tue, Oct 28, 2014 at 10:48 PM, Michael Armbrust mich...@databricks.com wrote: Can you println the .queryExecution of the SchemaRDD? On Tue, Oct 28, 2014 at 7:43 PM, Corey Nolet cjno...@gmail.com wrote: So this appears to work just fine: hctx.sql(SELECT

Re: unsubscribe

2014-10-31 Thread Corey Nolet
Hongbin, Please send an email to user-unsubscr...@spark.apache.org in order to unsubscribe from the user list. On Fri, Oct 31, 2014 at 9:05 AM, Hongbin Liu hongbin@theice.com wrote: Apology for having to send to all. I am highly interested in spark, would like to stay in this mailing

Re: Spark SQL takes unexpected time

2014-11-04 Thread Corey Nolet
Michael, I should probably look closer myself @ the design of 1.2 vs 1.1 but I've been curious why Spark's in-memory data uses the heap instead of putting it off heap? Was this the optimization that was done in 1.2 to alleviate GC? On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari

Why mapred for the HadoopRDD?

2014-11-04 Thread Corey Nolet
I'm fairly new to spark and I'm trying to kick the tires with a few InputFormats. I noticed the sc.hadoopRDD() method takes a mapred JobConf instead of a MapReduce Job object. Is there future planned support for the mapreduce packaging?

Configuring custom input format

2014-11-05 Thread Corey Nolet
I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting up the configuration file via the static methods on input formats that require a Hadoop Job object is proving to be difficult. Trying to new up my own Job object with the

Re: Configuring custom input format

2014-11-05 Thread Corey Nolet
, Corey Nolet cjno...@gmail.com wrote: I'm trying to use a custom input format with SparkContext.newAPIHadoopRDD. Creating the new RDD works fine but setting up the configuration file via the static methods on input formats that require a Hadoop Job object is proving to be difficult. Trying

Re: Selecting Based on Nested Values using Language Integrated Query Syntax

2014-11-06 Thread Corey Nolet
place that there is a problem is 'ln.streetnumber, which prevents the rest of the query from resolving. If you look at the subquery ln, it is only producing two columns: locationName and locationNumber. So streetnumber is not valid. On Tue, Oct 28, 2014 at 8:02 PM, Corey Nolet cjno...@gmail.com

Spark SQL Lazy Schema Evaluation

2014-11-12 Thread Corey Nolet
I'm loading sequence files containing json blobs in the value, transforming them into RDD[String] and then using hiveContext.jsonRDD(). It looks like Spark reads the files twice- once when I I define the jsonRDD() and then again when I actually make my call to hiveContext.sql(). Looking @ the

Re: unsubscribe

2014-11-18 Thread Corey Nolet
Abdul, Please send an email to user-unsubscr...@spark.apache.org On Tue, Nov 18, 2014 at 2:05 PM, Abdul Hakeem alhak...@gmail.com wrote: - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Configuring custom input format

2014-11-25 Thread Corey Nolet
assigning the object to a temporary variable. Matei On Nov 5, 2014, at 2:54 PM, Corey Nolet cjno...@gmail.com wrote: The closer I look @ the stack trace in the Scala shell, it appears to be the call to toString() that is causing the construction of the Job object to fail. Is there a ways

Running two different Spark jobs vs multi-threading RDDs

2014-12-05 Thread Corey Nolet
I've read in the documentation that RDDs can be run concurrently when submitted in separate threads. I'm curious how the scheduler would handle propagating these down to the tasks. I have 3 RDDs: - one RDD which loads some initial data, transforms it and caches it - two RDDs which use the cached

Re: Running two different Spark jobs vs multi-threading RDDs

2014-12-06 Thread Corey Nolet
Reading the documentation a little more closely, I'm using the wrong terminology. I'm using stages to refer to what spark is calling a job. I guess application (more than one spark context) is what I'm asking about On Dec 5, 2014 5:19 PM, Corey Nolet cjno...@gmail.com wrote: I've read

Spark eating exceptions in multi-threaded local mode

2014-12-16 Thread Corey Nolet
I've been running a job in local mode using --master local[*] and I've noticed that, for some reason, exceptions appear to get eaten- as in, I don't see them. If i debug in my IDE, I'll see that an exception was thrown if I step through the code but if I just run the application, it appears

Re: When will spark 1.2 released?

2014-12-19 Thread Corey Nolet
The dates of the jars were still of Dec 10th. I figured that was because the jars were staged in Nexus on that date (before the vote). On Fri, Dec 19, 2014 at 12:16 PM, Ted Yu yuzhih...@gmail.com wrote: Looking at: http://search.maven.org/#browse%7C717101892 The dates of the jars were

Submit spark jobs inside web application

2014-12-29 Thread Corey Nolet
I want to have a SparkContext inside of a web application running in Jetty that i can use to submit jobs to a cluster of Spark executors. I am running YARN. Ultimately, I would love it if I could just use somethjing like SparkSubmit.main() to allocate a bunch of resoruces in YARN when the webapp

How to tell if RDD no longer has any children

2014-12-29 Thread Corey Nolet
Let's say I have an RDD which gets cached and has two children which do something with it: val rdd1 = ...cache() rdd1.saveAsSequenceFile() rdd1.groupBy()..saveAsSequenceFile() If I were to submit both calls to saveAsSequenceFile() in thread to take advantage of concurrency (where

Cached RDD

2014-12-29 Thread Corey Nolet
If I have 2 RDDs which depend on the same RDD like the following: val rdd1 = ... val rdd2 = rdd1.groupBy()... val rdd3 = rdd1.groupBy()... If I don't cache rdd1, will it's lineage be calculated twice (one for rdd2 and one for rdd3)?

Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan 2, 2015 at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to get a SparkContext

Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
I'm trying to get a SparkContext going in a web container which is being submitted through yarn-client. I'm trying two different approaches and both seem to be resulting in the same error from the yarn nodemanagers: 1) I'm newing up a spark context direct, manually adding all the lib jars from

Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
2, 2015 at 5:46 PM, Corey Nolet cjno...@gmail.com wrote: .. and looking even further, it looks like the actual command tha'ts executed starting up the JVM to run the org.apache.spark.deploy.yarn.ExecutorLauncher is passing in --class 'notused' --jar null. I would assume this isn't expected

Re: Submitting spark jobs through yarn-client

2015-01-02 Thread Corey Nolet
they aren't making it through. On Fri, Jan 2, 2015 at 5:02 PM, Corey Nolet cjno...@gmail.com wrote: Looking a little closer @ the launch_container.sh file, it appears to be adding a $PWD/__app__.jar to the classpath but there is no __app__.jar in the directory pointed to by PWD. Any ideas? On Fri, Jan

Re: Submitting spark jobs through yarn-client

2015-01-03 Thread Corey Nolet
driver application. Here's the example code on github: https://github.com/cjnolet/spark-jetty-server On Fri, Jan 2, 2015 at 11:35 PM, Corey Nolet cjno...@gmail.com wrote: So looking @ the actual code- I see where it looks like --class 'notused' --jar null is set on the ClientBase.scala when yarn

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
My mistake Marcello, I was looking at the wrong message. That reply was meant for bo yang. On Feb 4, 2015 4:02 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Corey, On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote: Another suggestion is to build Spark by yourself

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
works for YARN). Also thread at http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html . HTH, Markus On 02/03/2015 11:20 PM, Corey Nolet wrote: I'm having a really bad dependency conflict right now with Guava versions between my Spark

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke charles.fed...@gmail.com wrote: If you want to design something like Spark shell have a look at:

Custom JSON schema inference

2015-01-14 Thread Corey Nolet
I'm working with RDD[Map[String,Any]] objects all over my codebase. These objects were all originally parsed from JSON. The processing I do on RDDs consists of parsing json - grouping/transforming dataset into a feasible report - outputting data to a file. I've been wanting to infer the schemas

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
, Jan 17, 2015 at 4:29 PM, Michael Armbrust mich...@databricks.com wrote: How are you running your test here? Are you perhaps doing a .count()? On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet cjno...@gmail.com wrote: Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being pushed down to the DataRelation are not the product of the SELECT clause but rather just the columns explicitly included in the WHERE clause. Examples from my testing: SELECT * FROM myTable -- The required columns are

Spark SQL Custom Predicate Pushdown

2015-01-15 Thread Corey Nolet
I have document storage services in Accumulo that I'd like to expose to Spark SQL. I am able to push down predicate logic to Accumulo to have it perform only the seeks necessary on each tablet server to grab the results being asked for. I'm interested in using Spark SQL to push those predicates

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Corey Nolet
There's also an example of running a SparkContext in a java servlet container from Calrissian: https://github.com/calrissian/spark-jetty-server On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote: The question is about the ways to create a Windows desktop-based and/or

Re: Spark SQL Custom Predicate Pushdown

2015-01-16 Thread Corey Nolet
Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources *From:* Corey Nolet

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
an example [1] of what I'm trying to accomplish. [1] https://github.com/calrissian/accumulo-recipes/blob/273/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStore.scala#L49 On Fri, Jan 16, 2015 at 10:17 PM, Corey Nolet cjno...@gmail.com wrote: Hao, Thanks so much

Re: Accumulators

2015-01-14 Thread Corey Nolet
Just noticed an error in my wording. Should be I'm assuming it's not immediately aggregating on the driver each time I call the += on the Accumulator. On Wed, Jan 14, 2015 at 9:19 PM, Corey Nolet cjno...@gmail.com wrote: What are the limitations of using Accumulators to get a union of a bunch

Accumulators

2015-01-14 Thread Corey Nolet
What are the limitations of using Accumulators to get a union of a bunch of small sets? Let's say I have an RDD[Map{String,Any} and i want to do: rdd.map(accumulator += Set(_.get(entityType).get)) What implication does this have on performance? I'm assuming it's not immediately aggregating

Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word partition here is a tad different than the term partition that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Corey Nolet
We've been using commons configuration to pull our properties out of properties files and system properties (prioritizing system properties over others) and we add those properties to our spark conf explicitly and we use ArgoPartser to get the command line argument for which property file to load.

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the spark.kryo.registrator property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
the data to a single partition (no matter what window I set) and it seems to lock up my jobs. I waited for 15 minutes for a stage that usually takes about 15 seconds and I finally just killed the job in yarn. On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet cjno...@gmail.com wrote: So I tried

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current

Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-01-27 Thread Corey Nolet
I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property spark.shuffle.service.enabled in the spark-defaults.conf. The code mentions

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet cjno...@gmail.com wrote: In all of the soutions I've found thus far, sorting has been

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
, Corey Nolet cjno...@gmail.com wrote: I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet cjno...@gmail.com wrote: I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them. Each call to save as a sequence file appears to have done its work, the task says it completed in xxx.x seconds but then it pauses

Re: Web Service + Spark

2015-01-09 Thread Corey Nolet
Cui Lin, The solution largely depends on how you want your services deployed (Java web container, Spray framework, etc...) and if you are using a cluster manager like Yarn or Mesos vs. just firing up your own executors and master. I recently worked on an example for deploying Spark services

Submitting SparkContext and seeing driverPropsFetcher exception

2015-01-09 Thread Corey Nolet
I'm seeing this exception when creating a new SparkContext in YARN: [ERROR] AssociationError [akka.tcp://sparkdri...@coreys-mbp.home:58243] - [akka.tcp://driverpropsfetc...@coreys-mbp.home:58453]: Error [Shut down address: akka.tcp://driverpropsfetc...@coreys-mbp.home:58453] [

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I have a SELECT * from myTable WHERE locations_$homeAddress = '123 Elm St' It's telling me $ is invalid. Is this a bug?

Re: SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
This doesn't seem to have helped. On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust mich...@databricks.com wrote: Try using `backticks` to escape non-standard characters. On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote: I don't remember Oracle ever enforcing that I

Re: Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Nevermind- I think I may have had a schema-related issue (sometimes booleans were represented as string and sometimes as raw booleans but when I populated the schema one or the other was chosen. On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet cjno...@gmail.com wrote: Here are the results

Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Here are the results of a few different SQL strings (let's assume the schemas are valid for the data types used): SELECT * from myTable where key1 = true - no filters are pushed to my PrunedFilteredScan SELECT * from myTable where key1 = true and key2 = 5 - 1 filter (key2) is pushed to my

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet cjno...@gmail.com wrote: I am using an input format to load data from Accumulo [1] in to a Spark

Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together based on some predefined business cases. After updating to 1.2.0, some of the concurrency expectations about how the stages within jobs are executed

Re: What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
: Looks like the number of skipped stages couldn't be formatted. Cheers On Wed, Jan 7, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: We just upgraded to Spark 1.2.0 and we're seeing this in the UI.

What does (### skipped) mean in the Spark UI?

2015-01-07 Thread Corey Nolet
We just upgraded to Spark 1.2.0 and we're seeing this in the UI.

Re: Strange DAG scheduling behavior on currently dependent RDDs

2015-01-07 Thread Corey Nolet
lineages. What's strange is that this bug only surfaced when I updated Spark. On Wed, Jan 7, 2015 at 9:12 AM, Corey Nolet cjno...@gmail.com wrote: We just updated to Spark 1.2.0 from Spark 1.1.0. We have a small framework that we've been developing that connects various different RDDs together

StreamingListener

2015-03-11 Thread Corey Nolet
Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?

Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno

Re: Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
be listening to a partition. Yes, my understanding is that multiple receivers in one group are the way to consume a topic's partitions in parallel. On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet cjno...@gmail.com wrote: Looking @ [1], it seems to recommend pull from multiple Kafka topics in order

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
Thanks for taking this on Ted! On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote: I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhang zzh...@hortonworks.com wrote: Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: Ted. That one I know. It was the dependency part I

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
+1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at

Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
if there was an automatic partition reconfiguration function that automagically did that... On Tue, Feb 24, 2015 at 3:20 AM, Corey Nolet cjno...@gmail.com wrote: I *think* this may have been related to the default memory overhead setting being too low. I raised the value to 1G it and tried my job again

Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
Looking @ [1], it seems to recommend pull from multiple Kafka topics in order to parallelize data received from Kafka over multiple nodes. I notice in [2], however, that one of the createConsumer() functions takes a groupId. So am I understanding correctly that creating multiple DStreams with the

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
: Could you try to turn on the external shuffle service? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
- but i have a suspicion this may have been the cause of the executors being killed by the application master. On Feb 23, 2015 5:25 PM, Corey Nolet cjno...@gmail.com wrote: I've got the opposite problem with regards to partitioning. I've got over 6000 partitions for some of these RDDs which

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I see the rdd.dependencies() function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say rdd2.dependencies.contains(rdd1)? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet
I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
be the behavior and myself and all my coworkers expected. On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet cjno...@gmail.com wrote: I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
in almost all cases. That much, I don't know how hard it is to implement. But I speculate that it's easier to deal with it at that level than as a function of the dependency graph. On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to do the scheduling myself now

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoopFile(…)] In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen concurrently. Thanks. Zhan Zhang On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.com wrote: What confused me

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi def getRDDStorageInfo: Array[RDDInfo] = { Cheers On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet cjno...@gmail.com wrote: Zhan, This is exactly what I'm

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Corey Nolet
Spark uses a SerializableWritable [1] to java serialize writable objects. I've noticed (at least in Spark 1.2.1) that it breaks down with some objects when Kryo is used instead of regular java serialization. Though it is wrapping the actual AccumuloInputFormat (another example of something you

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Corey Nolet
I would do sum square. This would allow you to keep an ongoing value as an associative operation (in an aggregator) and then calculate the variance std deviation after the fact. On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, I have a DataFrame object and I

Streaming anomaly detection using ARIMA

2015-03-27 Thread Corey Nolet
I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several

SparkR newHadoopAPIRDD

2015-04-01 Thread Corey Nolet
How hard would it be to expose this in some way? I ask because the current textFile and objectFile functions are obviously at some point calling out to a FileInputFormat and configuring it. Could we get a way to configure any arbitrary inputformat / outputformat?

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
for ARIMA models? On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote: Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Corey Nolet
If you return an iterable, you are not tying the API to a compactbuffer. Someday, the data could be fetched lazily and he API would not have to change. On Apr 23, 2015 6:59 PM, Dean Wampler deanwamp...@gmail.com wrote: I wasn't involved in this decision (I just make the fries), but

Re: DAG

2015-04-25 Thread Corey Nolet
Giovanni, The DAG can be walked by calling the dependencies() function on any RDD. It returns a Seq containing the parent RDDs. If you start at the leaves and walk through the parents until dependencies() returns an empty Seq, you ultimately have your DAG. On Sat, Apr 25, 2015 at 1:28 PM, Akhil

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Corey Nolet
A tad off topic, but could still be relevant. Accumulo's design is a tad different in the realm of being able to shard and perform set intersections/unions server-side (through seeks). I've got an adapter for Spark SQL on top of a document store implementation in Accumulo that accepts the

Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD call. After the input format runs, I want to do some directory cleanup and I want to block while I'm doing that. Is that something I can do inside of this function? If not, where would I accomplish this on every

Re: Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
It does look the function that's executed is in the driver so doing an Await.result() on a thread AFTER i've executed an action should work. Just updating this here in case anyone has this question in the future. Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD

Re: Streaming anomaly detection using ARIMA

2015-04-10 Thread Corey Nolet
tried this? Within a window you would probably take the first x% as training and the rest as test. I don't think there's a question of looking across windows. On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote: Surprised I haven't gotten any responses about this. Has

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Reducer memory usage

2015-06-21 Thread Corey Nolet
I've seen a few places where it's been mentioned that after a shuffle each reducer needs to pull its partition into memory in its entirety. Is this true? I'd assume the merge sort that needs to be done (in the cases where sortByKey() is not used) wouldn't need to pull all of the data into memory

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
of the data in the partition (fetching more than 1 record @ a time). On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet cjno...@gmail.com wrote: I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators Iterables in an API rather than collections-

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341 On Thu, Jun 18, 2015 at 7:55 PM, Du Li l...@yahoo-inc.com.invalid wrote: repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't

Coalescing with shuffle = false in imbalanced cluster

2015-06-18 Thread Corey Nolet
I'm confused about this. The comment on the function seems to indicate that there is absolutely no shuffle or network IO but it also states that it assigns an even number of parent partitions to each final partition group. I'm having trouble seeing how this can be guaranteed without some data

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-18 Thread Corey Nolet
at 5:51 PM, Corey Nolet cjno...@gmail.com wrote: An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi all, Is there any way

  1   2   >