Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet
g- > apache-spark-on-a-single-node-machine.html > > regars, > > 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>: > >> Please forgive me if this question has been asked already. >> >> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'

PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet
Please forgive me if this question has been asked already. I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of

Re: Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
the other user gets worked into the model. On Mon, Nov 27, 2017 at 3:08 PM, Corey Nolet <cjno...@gmail.com> wrote: > I'm trying to use the MatrixFactorizationModel to, for instance, determine > the latent factors of a user or item that were not used in the training > data of

Using MatrixFactorizationModel as a feature extractor

2017-11-27 Thread Corey Nolet
I'm trying to use the MatrixFactorizationModel to, for instance, determine the latent factors of a user or item that were not used in the training data of the model. I'm not as concerned about the rating as I am with the latent factors for the user/item. Thanks!

Re: Apache Flink

2016-04-17 Thread Corey Nolet
on't care what each individual > event/tuple does, e.g. of you push different event types to separate kafka > topics and all you care is to do a count, what is the need for single event > processing. > > On Sun, Apr 17, 2016 at 12:43 PM, Corey Nolet <cjno...@gmail.c

Re: Apache Flink

2016-04-17 Thread Corey Nolet
> > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com

Re: Apache Flink

2016-04-17 Thread Corey Nolet
One thing I've noticed about Flink in my following of the project has been that it has established, in a few cases, some novel ideas and improvements over Spark. The problem with it, however, is that both the development team and the community around it are very small and many of those novel

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
Nevermind, a look @ the ExternalSorter class tells me that the iterator for each key that's only partially ordered ends up being merge sorted by equality after the fact. Wanted to post my finding on here for others who may have the same questions. On Tue, Mar 1, 2016 at 3:05 PM, Corey Nolet

Re: Shuffle guarantees

2016-03-01 Thread Corey Nolet
. How can this be assumed if the object used for the key, for instance, in the case where a HashPartitioner is used, cannot assume ordering and therefore cannot assume a comparator can be used? On Tue, Mar 1, 2016 at 2:56 PM, Corey Nolet <cjno...@gmail.com> wrote: > So if I'

Shuffle guarantees

2016-03-01 Thread Corey Nolet
So if I'm using reduceByKey() with a HashPartitioner, I understand that the hashCode() of my key is used to create the underlying shuffle files. Is anything other than hashCode() used in the shuffle files when the data is pulled into the reducers and run through the reduce function? The reason

Re: Shuffle memory woes

2016-02-08 Thread Corey Nolet
spark dev people will say. > Corey do you have presentation available online? > > On 8 February 2016 at 05:16, Corey Nolet <cjno...@gmail.com> wrote: > >> Charles, >> >> Thank you for chiming in and I'm glad someone else is experiencing this >> too and n

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
of children and doesn't even run concurrently with any other stages so I ruled out the concurrency of the stages as a culprit for the shuffliing problem we're seeing. On Sun, Feb 7, 2016 at 7:49 AM, Corey Nolet <cjno...@gmail.com> wrote: > Igor, > > I don't think the question is "wh

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
by key or something it should be > ok, so some detail is missing...skewed data? aggregate by key? > > On 6 February 2016 at 20:13, Corey Nolet <cjno...@gmail.com> wrote: > >> Igor, >> >> Thank you for the response but unfortunately, the problem I'm referring

Re: Shuffle memory woes

2016-02-07 Thread Corey Nolet
ey: >>"The dataset is 100gb at most, the spills can up to 10T-100T", Are >> your input files lzo format, and you use sc.text() ? If memory is not >> enough, spark will spill 3-4x of input data to disk. >> >> >> -- 原始邮件 ---

Re: Help needed in deleting a message posted in Spark User List

2016-02-06 Thread Corey Nolet
The whole purpose of Apache mailing lists is that the messages get indexed all over the web so that discussions and questions/solutions can be searched easily by google and other engines. For this reason, and the messages being sent via email as Steve pointed out, it's just not possible to

Re: Shuffle memory woes

2016-02-06 Thread Corey Nolet
rtitions > play with shuffle memory fraction > > in spark 1.6 cache vs shuffle memory fractions are adjusted automatically > > On 5 February 2016 at 23:07, Corey Nolet <cjno...@gmail.com> wrote: > >> I just recently had a discovery that my jobs were taking several hours to &

Shuffle memory woes

2016-02-05 Thread Corey Nolet
I just recently had a discovery that my jobs were taking several hours to completely because of excess shuffle spills. What I found was that when I hit the high point where I didn't have enough memory for the shuffles to store all of their file consolidations at once, it could spill so many times

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread Corey Nolet
David, Thank you very much for announcing this! It looks like it could be very useful. Would you mind providing a link to the github? On Tue, Jan 12, 2016 at 10:03 AM, David wrote: > Hi all, > > I'd like to share news of the recent release of a new Spark

Re: MongoDB and Spark

2015-09-11 Thread Corey Nolet
Unfortunately, MongoDB does not directly expose its locality via its client API so the problem with trying to schedule Spark tasks against it is that the tasks themselves cannot be scheduled locally on nodes containing query results- which means you can only assume most results will be sent over

Re: What is the reason for ExecutorLostFailure?

2015-08-18 Thread Corey Nolet
Usually more information as to the cause of this will be found down in your logs. I generally see this happen when an out of memory exception has occurred for one reason or another on an executor. It's possible your memory settings are too small per executor or the concurrent number of tasks you

Re: Newbie question: what makes Spark run faster than MapReduce

2015-08-07 Thread Corey Nolet
1) Spark only needs to shuffle when data needs to be partitioned around the workers in an all-to-all fashion. 2) Multi-stage jobs that would normally require several map reduce jobs, thus causing data to be dumped to disk between the jobs can be cached in memory.

SparkConf ignoring keys

2015-08-05 Thread Corey Nolet
I've been using SparkConf on my project for quite some time now to store configuration information for its various components. This has worked very well thus far in situations where I have control over the creation of the SparkContext the SparkConf. I have run into a bit of a problem trying to

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-28 Thread Corey Nolet
related logs can be found in RM ,NM, DN, NN log files in detail. Thanks again. On Mon, Jul 27, 2015 at 7:45 PM, Corey Nolet cjno...@gmail.com wrote: Elkhan, What does the ResourceManager say about the final status of the job? Spark jobs that run as Yarn applications can fail but still

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-27 Thread Corey Nolet
Elkhan, What does the ResourceManager say about the final status of the job? Spark jobs that run as Yarn applications can fail but still successfully clean up their resources and give them back to the Yarn cluster. Because of this, there's a difference between your code throwing an exception in

MapType vs StructType

2015-07-17 Thread Corey Nolet
I notice JSON objects are all parsed as Map[String,Any] in Jackson but for some reason, the inferSchema tools in Spark SQL extracts the schema of nested JSON objects as StructTypes. This makes it really confusing when trying to rectify the object hierarchy when I have maps because the Catalyst

Re: MapType vs StructType

2015-07-17 Thread Corey Nolet
doesn't have differentiated data structures so we go with the one that gives you more information when doing inference by default. If you pass in a schema to JSON however, you can override this and have a JSON object parsed as a map. On Fri, Jul 17, 2015 at 11:02 AM, Corey Nolet cjno

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
of the data in the partition (fetching more than 1 record @ a time). On Thu, Jun 25, 2015 at 12:19 PM, Corey Nolet cjno...@gmail.com wrote: I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time

Re: map vs mapPartitions

2015-06-25 Thread Corey Nolet
I don't know exactly what's going on under the hood but I would not assume that just because a whole partition is not being pulled into memory @ one time that that means each record is being pulled at 1 time. That's the beauty of exposing Iterators Iterables in an API rather than collections-

Reducer memory usage

2015-06-21 Thread Corey Nolet
I've seen a few places where it's been mentioned that after a shuffle each reducer needs to pull its partition into memory in its entirety. Is this true? I'd assume the merge sort that needs to be done (in the cases where sortByKey() is not used) wouldn't need to pull all of the data into memory

Re: Grouping elements in a RDD

2015-06-20 Thread Corey Nolet
If you use rdd.mapPartitions(), you'll be able to get a hold of the iterators for each partiton. Then you should be able to do iterator.grouped(size) on each of the partitions. I think it may mean you have 1 element at the end of each partition that may have less than size elements. If that's okay

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L341 On Thu, Jun 18, 2015 at 7:55 PM, Du Li l...@yahoo-inc.com.invalid wrote: repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't

Coalescing with shuffle = false in imbalanced cluster

2015-06-18 Thread Corey Nolet
I'm confused about this. The comment on the function seems to indicate that there is absolutely no shuffle or network IO but it also states that it assigns an even number of parent partitions to each final partition group. I'm having trouble seeing how this can be guaranteed without some data

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-18 Thread Corey Nolet
at 5:51 PM, Corey Nolet cjno...@gmail.com wrote: An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi all, Is there any way

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Corey Nolet
Doesn't repartition call coalesce(shuffle=true)? On Jun 18, 2015 6:53 PM, Du Li l...@yahoo-inc.com.invalid wrote: I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing

Executor memory allocations

2015-06-17 Thread Corey Nolet
So I've seen in the documentation that (after the overhead memory is subtracted), the memory allocations of each executor are as follows (assume default settings): 60% for cache 40% for tasks to process data Reading about how Spark implements shuffling, I've also seen it say 20% of executor

Re: Is there programmatic way running Spark job on Yarn cluster without using spark-submit script ?

2015-06-17 Thread Corey Nolet
An example of being able to do this is provided in the Spark Jetty Server project [1] [1] https://github.com/calrissian/spark-jetty-server On Wed, Jun 17, 2015 at 8:29 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi all, Is there any way running Spark job in programmatic way on Yarn

Using spark.hadoop.* to set Hadoop properties

2015-06-17 Thread Corey Nolet
I've become accustomed to being able to use system properties to override properties in the Hadoop Configuration objects. I just recently noticed that when Spark creates the Hadoop Configuraiton in the SparkContext, it cycles through any properties prefixed with spark.hadoop. and add those

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet cjno...@gmail.com wrote: Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet cjno...@gmail.com wrote: So with this... to help my understanding of Spark under the hood- Is this statement correct When

Re: yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
...@cloudera.com wrote: Hi Corey, As of this PR https://github.com/apache/spark/pull/5297/files, this can be controlled with spark.yarn.submit.waitAppCompletion. -Sandy On Thu, May 28, 2015 at 11:48 AM, Corey Nolet cjno...@gmail.com wrote: I am submitting jobs to my yarn cluster via the yarn

yarn-cluster spark-submit process not dying

2015-05-28 Thread Corey Nolet
I am submitting jobs to my yarn cluster via the yarn-cluster mode and I'm noticing the jvm that fires up to allocate the resources, etc... is not going away after the application master and executors have been allocated. Instead, it just sits there printing 1 second status updates to the console.

Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD call. After the input format runs, I want to do some directory cleanup and I want to block while I'm doing that. Is that something I can do inside of this function? If not, where would I accomplish this on every

Re: Blocking DStream.forEachRDD()

2015-05-07 Thread Corey Nolet
It does look the function that's executed is in the driver so doing an Await.result() on a thread AFTER i've executed an action should work. Just updating this here in case anyone has this question in the future. Is this somehtign I can do. I am using a FileOutputFormat inside of the foreachRDD

Re: real time Query engine Spark-SQL on Hbase

2015-04-30 Thread Corey Nolet
A tad off topic, but could still be relevant. Accumulo's design is a tad different in the realm of being able to shard and perform set intersections/unions server-side (through seeks). I've got an adapter for Spark SQL on top of a document store implementation in Accumulo that accepts the

Re: DAG

2015-04-25 Thread Corey Nolet
Giovanni, The DAG can be walked by calling the dependencies() function on any RDD. It returns a Seq containing the parent RDDs. If you start at the leaves and walk through the parents until dependencies() returns an empty Seq, you ultimately have your DAG. On Sat, Apr 25, 2015 at 1:28 PM, Akhil

Re: why does groupByKey return RDD[(K, Iterable[V])] not RDD[(K, CompactBuffer[V])] ?

2015-04-23 Thread Corey Nolet
If you return an iterable, you are not tying the API to a compactbuffer. Someday, the data could be fetched lazily and he API would not have to change. On Apr 23, 2015 6:59 PM, Dean Wampler deanwamp...@gmail.com wrote: I wasn't involved in this decision (I just make the fries), but

Re: Streaming anomaly detection using ARIMA

2015-04-10 Thread Corey Nolet
tried this? Within a window you would probably take the first x% as training and the rest as test. I don't think there's a question of looking across windows. On Thu, Apr 2, 2015 at 12:31 AM, Corey Nolet cjno...@gmail.com wrote: Surprised I haven't gotten any responses about this. Has

SparkR newHadoopAPIRDD

2015-04-01 Thread Corey Nolet
How hard would it be to expose this in some way? I ask because the current textFile and objectFile functions are obviously at some point calling out to a FileInputFormat and configuring it. Could we get a way to configure any arbitrary inputformat / outputformat?

Re: Streaming anomaly detection using ARIMA

2015-04-01 Thread Corey Nolet
for ARIMA models? On Mon, Mar 30, 2015 at 9:30 AM, Corey Nolet cjno...@gmail.com wrote: Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched

Re: Streaming anomaly detection using ARIMA

2015-03-30 Thread Corey Nolet
Taking out the complexity of the ARIMA models to simplify things- I can't seem to find a good way to represent even standard moving averages in spark streaming. Perhaps it's my ignorance with the micro-batched style of the DStreams API. On Fri, Mar 27, 2015 at 9:13 PM, Corey Nolet cjno

Streaming anomaly detection using ARIMA

2015-03-27 Thread Corey Nolet
I want to use ARIMA for a predictive model so that I can take time series data (metrics) and perform a light anomaly detection. The time series data is going to be bucketed to different time units (several minutes within several hours, several hours within several days, several days within several

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-26 Thread Corey Nolet
Spark uses a SerializableWritable [1] to java serialize writable objects. I've noticed (at least in Spark 1.2.1) that it breaks down with some objects when Kryo is used instead of regular java serialization. Though it is wrapping the actual AccumuloInputFormat (another example of something you

Re: [SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Corey Nolet
I would do sum square. This would allow you to keep an ongoing value as an associative operation (in an aggregator) and then calculate the variance std deviation after the fact. On Wed, Mar 25, 2015 at 10:28 PM, Haopu Wang hw...@qilinsoft.com wrote: Hi, I have a DataFrame object and I

StreamingListener

2015-03-11 Thread Corey Nolet
Given the following scenario: dstream.map(...).filter(...).window(...).foreachrdd() When would the onBatchCompleted fire?

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
Thanks for taking this on Ted! On Sat, Feb 28, 2015 at 4:17 PM, Ted Yu yuzhih...@gmail.com wrote: I have created SPARK-6085 with pull request: https://github.com/apache/spark/pull/4836 Cheers On Sat, Feb 28, 2015 at 12:08 PM, Corey Nolet cjno...@gmail.com wrote: +1 to a better default

Re: bitten by spark.yarn.executor.memoryOverhead

2015-02-28 Thread Corey Nolet
+1 to a better default as well. We were working find until we ran against a real dataset which was much larger than the test dataset we were using locally. It took me a couple days and digging through many logs to figure out this value was what was causing the problem. On Sat, Feb 28, 2015 at

Re: Missing shuffle files

2015-02-28 Thread Corey Nolet
if there was an automatic partition reconfiguration function that automagically did that... On Tue, Feb 24, 2015 at 3:20 AM, Corey Nolet cjno...@gmail.com wrote: I *think* this may have been related to the default memory overhead setting being too low. I raised the value to 1G it and tried my job again

Re: Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
be listening to a partition. Yes, my understanding is that multiple receivers in one group are the way to consume a topic's partitions in parallel. On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet cjno...@gmail.com wrote: Looking @ [1], it seems to recommend pull from multiple Kafka topics in order

Kafka DStream Parallelism

2015-02-27 Thread Corey Nolet
Looking @ [1], it seems to recommend pull from multiple Kafka topics in order to parallelize data received from Kafka over multiple nodes. I notice in [2], however, that one of the createConsumer() functions takes a groupId. So am I understanding correctly that creating multiple DStreams with the

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Zhang zzh...@hortonworks.com wrote: Currently in spark, it looks like there is no easy way to know the dependencies. It is solved at run time. Thanks. Zhan Zhang On Feb 26, 2015, at 4:20 PM, Corey Nolet cjno...@gmail.com wrote: Ted. That one I know. It was the dependency part I

How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1 = sparkContext.sequenceFile().cache() val rdd2 = rdd1.map() How would I tell programmatically without being the one who built rdd1 and rdd2 whether or not rdd2

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
I see the rdd.dependencies() function, does that include ALL the dependencies of an RDD? Is it safe to assume I can say rdd2.dependencies.contains(rdd1)? On Thu, Feb 26, 2015 at 4:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
the execution if there is no shuffle dependencies in between RDDs. Thanks. Zhan Zhang On Feb 26, 2015, at 1:28 PM, Corey Nolet cjno...@gmail.com wrote: Let's say I'm given 2 RDDs and told to store them in a sequence file and they have the following dependency: val rdd1

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
be the behavior and myself and all my coworkers expected. On Thu, Feb 26, 2015 at 6:26 PM, Corey Nolet cjno...@gmail.com wrote: I should probably mention that my example case is much over simplified- Let's say I've got a tree, a fairly complex one where I begin a series of jobs at the root which

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
in almost all cases. That much, I don't know how hard it is to implement. But I speculate that it's easier to deal with it at that level than as a function of the dependency graph. On Thu, Feb 26, 2015 at 10:49 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to do the scheduling myself now

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
future { rdd1.saveAsHasoopFile(...) } future { rdd2.saveAsHadoopFile(…)] In this way, rdd1 will be calculated once, and two saveAsHadoopFile will happen concurrently. Thanks. Zhan Zhang On Feb 26, 2015, at 3:28 PM, Corey Nolet cjno...@gmail.com wrote: What confused me

Re: How to tell if one RDD depends on another

2015-02-26 Thread Corey Nolet
: * Return information about what RDDs are cached, if they are in mem or on disk, how much space * they take, etc. */ @DeveloperApi def getRDDStorageInfo: Array[RDDInfo] = { Cheers On Thu, Feb 26, 2015 at 4:00 PM, Corey Nolet cjno...@gmail.com wrote: Zhan, This is exactly what I'm

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
: Could you try to turn on the external shuffle service? spark.shuffle.service.enable = true On 21.2.2015. 17:50, Corey Nolet wrote: I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm

Re: Missing shuffle files

2015-02-23 Thread Corey Nolet
- but i have a suspicion this may have been the cause of the executors being killed by the application master. On Feb 23, 2015 5:25 PM, Corey Nolet cjno...@gmail.com wrote: I've got the opposite problem with regards to partitioning. I've got over 6000 partitions for some of these RDDs which

Re: Missing shuffle files

2015-02-21 Thread Corey Nolet
I'm experiencing the same issue. Upon closer inspection I'm noticing that executors are being lost as well. Thing is, I can't figure out how they are dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory allocated for the application. I was thinking perhaps it was possible that a

Re: Can't I mix non-Spark properties into a .properties file and pass it to spark-submit via --properties-file?

2015-02-16 Thread Corey Nolet
We've been using commons configuration to pull our properties out of properties files and system properties (prioritizing system properties over others) and we add those properties to our spark conf explicitly and we use ArgoPartser to get the command line argument for which property file to load.

SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
I don't remember Oracle ever enforcing that I couldn't include a $ in a column name, but I also don't thinking I've ever tried. When using sqlContext.sql(...), I have a SELECT * from myTable WHERE locations_$homeAddress = '123 Elm St' It's telling me $ is invalid. Is this a bug?

Re: SparkSQL doesn't seem to like $'s in column names

2015-02-13 Thread Corey Nolet
This doesn't seem to have helped. On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust mich...@databricks.com wrote: Try using `backticks` to escape non-standard characters. On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote: I don't remember Oracle ever enforcing that I

Re: Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Nevermind- I think I may have had a schema-related issue (sometimes booleans were represented as string and sometimes as raw booleans but when I populated the schema one or the other was chosen. On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet cjno...@gmail.com wrote: Here are the results

Boolean values as predicates in SQL string

2015-02-13 Thread Corey Nolet
Here are the results of a few different SQL strings (let's assume the schemas are valid for the data types used): SELECT * from myTable where key1 = true - no filters are pushed to my PrunedFilteredScan SELECT * from myTable where key1 = true and key2 = 5 - 1 filter (key2) is pushed to my

Re: Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I was able to get this working by extending KryoRegistrator and setting the spark.kryo.registrator property. On Thu, Feb 12, 2015 at 12:31 PM, Corey Nolet cjno...@gmail.com wrote: I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
group should need to fit. On Wed, Feb 11, 2015 at 2:56 PM, Corey Nolet cjno...@gmail.com wrote: Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-12 Thread Corey Nolet
the data to a single partition (no matter what window I set) and it seems to lock up my jobs. I waited for 15 minutes for a stage that usually takes about 15 seconds and I finally just killed the job in yarn. On Thu, Feb 12, 2015 at 4:40 PM, Corey Nolet cjno...@gmail.com wrote: So I tried

Using Spark SQL for temporal data

2015-02-12 Thread Corey Nolet
I have a temporal data set in which I'd like to be able to query using Spark SQL. The dataset is actually in Accumulo and I've already written a CatalystScan implementation and RelationProvider[1] to register with the SQLContext so that I can apply my SQL statements. With my current

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
I think the word partition here is a tad different than the term partition that we use in Spark. Basically, I want something similar to Guava's Iterables.partition [1], that is, If I have an RDD[People] and I want to run an algorithm that can be optimized by working on 30 people at a time, I'd

Re: Easy way to partition an RDD into chunks like Guava's Iterables.partition

2015-02-11 Thread Corey Nolet
Doesn't iter still need to fit entirely into memory? On Wed, Feb 11, 2015 at 5:55 PM, Mark Hamstra m...@clearstorydata.com wrote: rdd.mapPartitions { iter = val grouped = iter.grouped(batchSize) for (group - grouped) { ... } } On Wed, Feb 11, 2015 at 2:44 PM, Corey Nolet cjno

Re: Writable serialization from InputFormat losing fields

2015-02-10 Thread Corey Nolet
I am able to get around the problem by doing a map and getting the Event out of the EventWritable before I do my collect. I think I'll do that for now. On Tue, Feb 10, 2015 at 6:04 PM, Corey Nolet cjno...@gmail.com wrote: I am using an input format to load data from Accumulo [1] in to a Spark

Re: How to design a long live spark application

2015-02-05 Thread Corey Nolet
Here's another lightweight example of running a SparkContext in a common java servlet container: https://github.com/calrissian/spark-jetty-server On Thu, Feb 5, 2015 at 11:46 AM, Charles Feduke charles.fed...@gmail.com wrote: If you want to design something like Spark shell have a look at:

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
My mistake Marcello, I was looking at the wrong message. That reply was meant for bo yang. On Feb 4, 2015 4:02 PM, Marcelo Vanzin van...@cloudera.com wrote: Hi Corey, On Wed, Feb 4, 2015 at 12:44 PM, Corey Nolet cjno...@gmail.com wrote: Another suggestion is to build Spark by yourself

Re: “mapreduce.job.user.classpath.first” for Spark

2015-02-04 Thread Corey Nolet
works for YARN). Also thread at http://apache-spark-user-list.1001560.n3.nabble.com/netty-on-classpath-when-using-spark-submit-td18030.html . HTH, Markus On 02/03/2015 11:20 PM, Corey Nolet wrote: I'm having a really bad dependency conflict right now with Guava versions between my Spark

“mapreduce.job.user.classpath.first” for Spark

2015-02-03 Thread Corey Nolet
I'm having a really bad dependency conflict right now with Guava versions between my Spark application in Yarn and (I believe) Hadoop's version. The problem is, my driver has the version of Guava which my application is expecting (15.0) while it appears the Spark executors that are working on my

Long pauses after writing to sequence files

2015-01-30 Thread Corey Nolet
We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them. Each call to save as a sequence file appears to have done its work, the task says it completed in xxx.x seconds but then it pauses

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed, Jan 28, 2015 at 9:07 AM, Corey Nolet cjno...@gmail.com wrote: In all of the soutions I've found thus far, sorting has been

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-28 Thread Corey Nolet
/scala/org/apache/spark/rdd/OrderedRDDFunctions.scala On Wed, Jan 28, 2015 at 9:16 AM, Corey Nolet cjno...@gmail.com wrote: I'm looking @ the ShuffledRDD code and it looks like there is a method setKeyOrdering()- is this guaranteed to order everything in the partition? I'm on Spark 1.2.0 On Wed

Spark 1.2.x Yarn Auxiliary Shuffle Service

2015-01-27 Thread Corey Nolet
I've read that this is supposed to be a rather significant optimization to the shuffle system in 1.1.0 but I'm not seeing much documentation on enabling this in Yarn. I see github classes for it in 1.2.0 and a property spark.shuffle.service.enabled in the spark-defaults.conf. The code mentions

Re: Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
, Corey Nolet cjno...@gmail.com wrote: I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running

Partition + equivalent of MapReduce multiple outputs

2015-01-27 Thread Corey Nolet
I need to be able to take an input RDD[Map[String,Any]] and split it into several different RDDs based on some partitionable piece of the key (groups) and then send each partition to a separate set of files in different folders in HDFS. 1) Would running the RDD through a custom partitioner be the

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
, Jan 17, 2015 at 4:29 PM, Michael Armbrust mich...@databricks.com wrote: How are you running your test here? Are you perhaps doing a .count()? On Sat, Jan 17, 2015 at 12:54 PM, Corey Nolet cjno...@gmail.com wrote: Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
Michael, What I'm seeing (in Spark 1.2.0) is that the required columns being pushed down to the DataRelation are not the product of the SELECT clause but rather just the columns explicitly included in the WHERE clause. Examples from my testing: SELECT * FROM myTable -- The required columns are

Re: Spark SQL Custom Predicate Pushdown

2015-01-17 Thread Corey Nolet
an example [1] of what I'm trying to accomplish. [1] https://github.com/calrissian/accumulo-recipes/blob/273/thirdparty/spark/src/main/scala/org/calrissian/accumulorecipes/spark/sql/EventStore.scala#L49 On Fri, Jan 16, 2015 at 10:17 PM, Corey Nolet cjno...@gmail.com wrote: Hao, Thanks so much

Re: Creating Apache Spark-powered “As Service” applications

2015-01-16 Thread Corey Nolet
There's also an example of running a SparkContext in a java servlet container from Calrissian: https://github.com/calrissian/spark-jetty-server On Fri, Jan 16, 2015 at 2:31 PM, olegshirokikh o...@solver.com wrote: The question is about the ways to create a Windows desktop-based and/or

Re: Spark SQL Custom Predicate Pushdown

2015-01-16 Thread Corey Nolet
Down: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala Examples also can be found in the unit test: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/sources *From:* Corey Nolet

  1   2   >