All,
What is the largest input data set y'all have come across that has been
successfully processed in production using spark. Ball park?
If I may add my contribution to this discussion if I understand well your
question...
DStream is discretized stream. It discretized the data stream over windows
of time (according to the project code I've read and paper too). so when
you write:
JavaStreamingContext stcObj = new
@TD: I do not need multiple RDDs in a DStream in every batch. On the contrary
my logic would work fine if there is only 1 RDD. But then the description for
functions like reduce count (Return a new DStream of single-element RDDs by
counting the number of elements in each RDD of the source
VerifyError: class
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$CreateSnapshotRequestProto
overrides final method getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
What can be cause of this error?
Regards,
Laeeq Ahmed,
PhD Student,
HPCViz, KTH.
This is what the web UI looks like:
[image: Inline image 1]
I also tail all the worker logs and theese are the last entries before the
waiting begins:
14/03/20 13:29:10 INFO BlockFetcherIterator$BasicBlockFetcherIterator:
maxBytesInFlight: 50331648, minRequest: 10066329
14/03/20 13:29:10 INFO
Why are you trying to reducebyKey? Are you looking to work on the data
sequentially.
If I understand correctly you are looking to filter your data using the
bloom filter each bloom filter is tied to which key is instantiating it.
Following are some of the options
*partiition* your data by key
Thats expected. I think sortByKey is option too probably a better one.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Mar 20, 2014 at 3:20 PM, Ameet Kini ameetk...@gmail.com wrote:
val rdd2 = rdd.partitionBy(my
I saw that but I don't need a global sort, only intra-partition sort.
Ameet
On Thu, Mar 20, 2014 at 3:26 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
Thats expected. I think sortByKey is option too probably a better one.
Mayur Rustagi
Ph: +1 (760) 203 3257
Would love to .. but I am in NY till Friday :(
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Wed, Mar 19, 2014 at 7:34 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I'm in San Francisco until Friday for a
Mayur,
To be a little clearer, for creating the Bloom Filters, I don't think
broadcast variables are the way to go, though definitely that would work
for using the Bloom Filters to filter data.
The reason why is that the creation needs to happen in a single thread.
Otherwise, some type of
Thank you Ewen. RDD.pipe is what I need and it works like a charm. On the
other side RDD.mapPartitions seems to be interesting but I can't figure out
how to make it work.
Jaonary
On Thu, Mar 20, 2014 at 4:54 PM, Ewen Cheslack-Postava m...@ewencp.orgwrote:
Take a look at RDD.pipe().
You
Hi Reynold,
Nice! What spark configuration parameters did you use to get your job to
run successfully on a large dataset? My job is failing on 1TB of input data
(uncompressed) on a 4-node cluster (64GB memory per node). No OutOfMemory
errors just lost executors.
Thanks,
Soila
On Mar 20, 2014
Thanks TD, happy to share my experience with MLLib + Spark Streaming
integration.
Here's a gist with two examples I have working, one for
StreamingLinearRegression and another for StreamingKMeans.
https://gist.github.com/freeman-lab/9672685
The goal in each case was to implement a streaming
Grouped by the group_id but not sorted.
-Suren
On Thu, Mar 20, 2014 at 5:52 PM, Mayur Rustagi mayur.rust...@gmail.comwrote:
You are using the data grouped (sorted?) To create the bloom filter ?
On Mar 20, 2014 4:35 PM, Surendranauth Hiraman suren.hira...@velos.io
wrote:
Mayur,
To be a
Jim, I'm starting to document the heap size settings all in one place,
which has been a confusion for a lot of my peers. Maybe you can take a
look at this ticket?
https://spark-project.atlassian.net/browse/SPARK-1264
On Wed, Mar 19, 2014 at 12:53 AM, Jim Blomo jim.bl...@gmail.com wrote:
To
Yeah, this is definitely confusing. The motivation for this was that different
users of the same cluster may want to set different memory sizes for their
apps, so we decided to put this setting in the driver. However, if you put
SPARK_JAVA_OPTS in spark-env.sh, it also applies to executors,
Hi Adrian,
On every timestep of execution, we receive new data, then report updated word
counts for that new data plus the past 30 seconds. The latency here is about
how quickly you get these updated counts once the new batch of data comes in.
It’s true that the count reflects some data from
Hi,
I have run the spark application to process input data of size ~14GB with
executor memory 10GB. The job got stuck with below message
14/03/21 05:02:07 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(0, guavus-0102bf, 49347, 0) with no recent heart
beats: 85563ms
18 matches
Mail list logo