sample is not a member of org.apache.spark.streaming.dstream.DStream

2014-12-28 Thread Josh J
).map(_._2) streamtoread.sample(withReplacement = true, fraction = fraction) How do I use the sample http://spark.apache.org/docs/latest/programming-guide.html#transformations() method with Spark Streaming? Thanks, Josh

Re: action progress in ipython notebook?

2014-12-27 Thread Josh Rosen
/ stage / task progress information, as well as expanding the types of information exposed through the stable status API interface. - Josh On Thu, Dec 25, 2014 at 10:01 AM, Eric Friedman eric.d.fried...@gmail.com wrote: Spark 1.2.0 is SO much more usable than previous releases -- many thanks

Re: Discourse: A proposed alternative to the Spark User list

2014-12-25 Thread Josh Rosen
a bit of additional context in the meantime. - Josh On Thu, Dec 25, 2014 at 5:36 PM, Tobias Pfeiffer t...@preferred.jp wrote: Nick, uh, I would have expected a rather heated discussion, but the opposite seems to be the case ;-) Independent of my personal preferences w.r.t. usability, habits etc

Re: Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-17 Thread Josh Rosen
will be sent to both spark.incubator.apache.org and spark.apache.org (if that is the case, i'm not sure which alias nabble posts get sent to) would make things a lot more clear. On Sat, Dec 13, 2014 at 5:05 PM, Josh Rosen rosenvi...@gmail.com wrote: I've noticed that several users are attempting to post

Nabble mailing list mirror errors: This post has NOT been accepted by the mailing list yet

2014-12-13 Thread Josh Rosen
by the mailing list. I wanted to mention this issue to the Spark community to see whether there are any good solutions to address this. I have spoken to users who think that our mailing list is unresponsive / inactive because their un-posted messages haven't received any replies. - Josh

Re: java.io.InvalidClassException: org.apache.spark.api.java.JavaUtils$SerializableMapWrapper; no valid constructor

2014-12-01 Thread Josh Rosen
SerializableMapWrapper was added in https://issues.apache.org/jira/browse/SPARK-3926; do you mind opening a new JIRA and linking it to that one? On Mon, Dec 1, 2014 at 12:17 AM, lokeshkumar lok...@dataken.net wrote: The workaround was to wrap the map returned by spark libraries into HashMap

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
can maintain exactly once semantics when writing to topic 2? Thanks, Josh

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith secs...@gmail.com wrote: I'd be interested in finding the answer too. Right now, I do: val kafkaOutMsgs = kafkInMessages.map(x=myFunc(x._2,someParam))

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-24 Thread Josh Mahonin
also do a lot more with it than just the Phoenix functions provide. I don't know if this works with PySpark or not, but assuming the 'newHadoopRDD' functionality works for other input formats, it should work for Phoenix as well. Josh On Fri, Nov 21, 2014 at 5:12 PM, Alaa Ali contact.a...@gmail.com

Re: Spark SQL with Apache Phoenix lower and upper Bound

2014-11-21 Thread Josh Mahonin
: https://github.com/simplymeasured/phoenix-spark Josh On Fri, Nov 21, 2014 at 4:14 PM, Alaa Ali contact.a...@gmail.com wrote: I want to run queries on Apache Phoenix which has a JDBC driver. The query that I want to run is: select ts,ename from random_data_date limit 10 But I'm having issues

Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me in the right direction? Thanks, Josh

Re: Adaptive stream processing and dynamic batch sizing

2014-11-14 Thread Josh J
Referring to this paper http://dl.acm.org/citation.cfm?id=2670995. On Fri, Nov 14, 2014 at 10:42 AM, Josh J joshjd...@gmail.com wrote: Hi, I was wondering if the adaptive stream processing and dynamic batch processing was available to use in spark streaming? If someone could help point me

concat two Dstreams

2014-11-11 Thread Josh J
Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined Dstream. Thanks, Josh

Re: concat two Dstreams

2014-11-11 Thread Josh J
I think it's just called union On Tue, Nov 11, 2014 at 2:41 PM, Josh J joshjd...@gmail.com wrote: Hi, Is it possible to concatenate or append two Dstreams together? I have an incoming stream that I wish to combine with data that's generated by a utility. I then need to process the combined

convert ListString to dstream

2014-11-10 Thread Josh J
) and found : java.util.LinkedList[org.apache.spark.rdd.RDD[String]] required: scala.collection.mutable.Queue[org.apache.spark.rdd.RDD[?]] Thanks, Josh

Re: scala RDD sortby compilation error

2014-11-04 Thread Josh J
: Ordering[K], implicit ctag: scala.reflect.ClassTag[K])org.apache.spark.rdd.RDD[String]. Unspecified value parameter f. On Tue, Nov 4, 2014 at 11:28 AM, Josh J joshjd...@gmail.com wrote: Hi, Does anyone have any good examples of using sortby for RDDs and scala? I'm receiving not enough

random shuffle streaming RDDs?

2014-11-03 Thread Josh J
Hi, Is there a nice or optimal method to randomly shuffle spark streaming RDDs? Thanks, Josh

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
? in general RDDs don't have ordering at all -- excepting when you sort for example -- so a permutation doesn't make sense. Do you just want a well-defined but random ordering of the data? Do you just want to (re-)assign elements randomly to partitions? On Mon, Nov 3, 2014 at 4:33 PM, Josh J joshjd

Re: random shuffle streaming RDDs?

2014-11-03 Thread Josh J
is guaranteed about that. If you want to permute an RDD, how about a sortBy() on a good hash function of each value plus some salt? (Haven't thought this through much but sounds about right.) On Mon, Nov 3, 2014 at 4:59 PM, Josh J joshjd...@gmail.com wrote: When I'm outputting the RDDs

Deleting temp dir Exception

2014-11-03 Thread Josh
Hi, I've written a short scala app to perform word counts on a text file and am getting the following exception as the program completes (after it prints out all of the word counts). Exception in thread delete Spark temp dir C:\Users\Josh\AppData\Local\Temp\spark-0fdd0b79-7329-4690-a093

run multiple spark applications in parallel

2014-10-28 Thread Josh J
Hi, How do I run multiple spark applications in parallel? I tried to run on yarn cluster, though the second application submitted does not run. Thanks, Josh

Re: run multiple spark applications in parallel

2014-10-28 Thread Josh J
, Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz 32 GB RAM Thanks, Josh On Tue, Oct 28, 2014 at 4:15 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Try reducing the resources (cores and memory) of each application. On Oct 28, 2014, at 7:05 PM, Josh J joshjd...@gmail.com wrote: Hi, How

exact count using rdd.count()?

2014-10-27 Thread Josh J
than once in the event of a worker failure. http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node Thanks, Josh

combine rdds?

2014-10-27 Thread Josh J
Hi, How could I combine rdds? I would like to combine two RDDs if the count in an RDD is not above some threshold. Thanks, Josh

docker spark 1.1.0 cluster

2014-10-24 Thread Josh J
Hi, Is there a dockerfiles available which allow to setup a docker spark 1.1.0 cluster? Thanks, Josh

streaming join sliding windows

2014-10-22 Thread Josh J
Hi, How can I join neighbor sliding windows in spark streaming? Thanks, Josh

Re: small bug in pyspark

2014-10-12 Thread Josh Rosen
drivers and workers. - Josh On Fri, Oct 10, 2014 at 5:24 PM, Andy Davidson a...@santacruzintegration.com wrote: Hi I am running spark on an ec2 cluster. I need to update python to 2.7. I have been following the directions on http://nbviewer.ipython.org/gist/JoshRosen/6856670 https

Re: What if I port Spark from TCP/IP to RDMA?

2014-10-12 Thread Josh Rosen
Hi Theo, Check out *spark-perf*, a suite of performance benchmarks for Spark: https://github.com/databricks/spark-perf. - Josh On Fri, Oct 10, 2014 at 7:27 PM, Theodore Si sjyz...@gmail.com wrote: Hi, Let's say that I managed to port Spark from TCP/IP to RDMA. What tool or benchmark can I

Re: pyspark on python 3

2014-10-03 Thread Josh Rosen
/2144 - Josh On Fri, Oct 3, 2014 at 6:44 PM, tomo cocoa cocoatom...@gmail.com wrote: Hi, I prefer that PySpark can also be executed on Python 3. Do you have some reason or demand to use PySpark through Python3? If you create an issue on JIRA, I would try to resolve it. On 4 October

Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using ~/ephemeral-hdfs/sbin/start-mapred.sh. On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote: Hi, I would like to copy log files from s3 to the cluster's ephemeral-hdfs. I tried to use distcp, but I guess

Re: countByWindow save the count ?

2014-08-26 Thread Josh J
of countByWindow with a function that performs the save operation. On Fri, Aug 22, 2014 at 1:58 AM, Josh J joshjd...@gmail.com wrote: Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka

countByWindow save the count ?

2014-08-22 Thread Josh J
Hi, Hopefully a simple question. Though is there an example of where to save the output of countByWindow ? I would like to save the results to external storage (kafka or redis). The examples show only stream.print() Thanks, Josh

multiple windows from the same DStream ?

2014-08-21 Thread Josh J
windowMessages1 = messages.window(windowLength,slideInterval); JavaPairDStreamString,String windowMessages2 = messages.window(windowLength,slideInterval); Thanks, Josh

DStream start a separate DStream

2014-08-21 Thread Josh J
DStream. How can I accomplish this with spark? Sincerely, Josh

Difference between amplab docker and spark docker?

2014-08-20 Thread Josh J
Hi, Whats the difference between amplab docker https://github.com/amplab/docker-scripts and spark docker https://github.com/apache/spark/tree/master/docker? Thanks, Josh

Re: Question on mappartitionwithsplit

2014-08-17 Thread Josh Rosen
Has anyone tried using functools.partial ( https://docs.python.org/2/library/functools.html#functools.partial) with PySpark? If it works, it might be a nice way to address this use-case. On Sun, Aug 17, 2014 at 7:35 PM, Davies Liu dav...@databricks.com wrote: On Sun, Aug 17, 2014 at 11:21 AM,

Re: Data from Mysql using JdbcRDD

2014-07-30 Thread Josh Mahonin
, upper bound index, and number of partitions. With that example query and those values, you should end up with an RDD with two partitions, one with the student_info from 1 through 10, and the second with ids 11 through 20. Josh On Wed, Jul 30, 2014 at 6:58 PM, chaitu reddy chaitzre...@gmail.com

Re: Broadcasting a set in PySpark

2014-07-18 Thread Josh Rosen
You have to use `myBroadcastVariable.value` to access the broadcasted value; see https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables On Fri, Jul 18, 2014 at 2:56 PM, Vedant Dhandhania ved...@retentionscience.com wrote: Hi All, I am trying to broadcast a set in a

Spark 1.0.1 EC2 - Launching Applications

2014-07-14 Thread Josh Happoldt
submit jobs to the cluster either. Thanks! Josh

Re: Working with Avro Generic Records in the interactive scala shell

2014-05-24 Thread Josh Marcus
Jeremy, Just to be clear, are you assembling a jar with that class compiled (with its dependencies) and including the path to that jar on the command line in an environment variable (e.g. SPARK_CLASSPATH=path ./spark-shell)? --j On Saturday, May 24, 2014, Jeremy Lewi jer...@lewi.us wrote: Hi

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
...@gmail.com'); wrote: Which version is this with? I haven’t seen standalone masters lose workers. Is there other stuff on the machines that’s killing them, or what errors do you see? Matei On May 16, 2014, at 9:53 AM, Josh Marcus jmar...@meetup.comjavascript:_e(%7B%7D,'cvml','jmar

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
, 2014 at 3:28 PM, Josh Marcus jmar...@meetup.com wrote: We're using spark 0.9.0, and we're using it out of the box -- not using Cloudera Manager or anything similar. There are warnings from the master that there continue to be heartbeats from the unregistered workers. I will see

Re: advice on maintaining a production spark cluster?

2014-05-20 Thread Josh Marcus
and messages semi-regularly on CDH5 + 0.9.0. I don't have any insight into when it happens, but usually after heavy use and after running for a long time. I had figured I'd see if the changes since 0.9.0 addressed it and revisit later. On Tue, May 20, 2014 at 8:37 PM, Josh Marcus jmar

advice on maintaining a production spark cluster?

2014-05-16 Thread Josh Marcus
Hey folks, I'm wondering what strategies other folks are using for maintaining and monitoring the stability of stand-alone spark clusters. Our master very regularly loses workers, and they (as expected) never rejoin the cluster. This is the same behavior I've seen using akka cluster (if that's

Re: Spark and HBase

2014-04-25 Thread Josh Mahonin
or not though, so if anyone else is looking into this, I'd love to hear their thoughts. Josh On Tue, Apr 8, 2014 at 1:00 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Just took a quick look at the overview herehttp://phoenix.incubator.apache.org/ and the quick start guide herehttp

Re: flatten RDD[RDD[T]]

2014-03-02 Thread Josh Rosen
Nope, nested RDDs aren't supported: https://groups.google.com/d/msg/spark-users/_Efj40upvx4/DbHCixW7W7kJ https://groups.google.com/d/msg/spark-users/KC1UJEmUeg8/N_qkTJ3nnxMJ https://groups.google.com/d/msg/spark-users/rkVPXAiCiBk/CORV5jyeZpAJ On Sun, Mar 2, 2014 at 5:37 PM, Cosmin Radoi

<    1   2