AFAIK Spark Streaming can not work in a way like this. Transformations are
made on DStreams, where DStreams are basically hold (time,
allocatedBlocksForBatch) pairs. Allocated blocks are allocated by the
JobGenerator, unallocated blocks (infos) are collected by
ReceivedBlockTracker. In Spark
There is a BlockGenerator on each worker node next to the
ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in
each interval (block_interval). These Blocks are passed to
ReceiverSupervisorImpl, which throws these blocks to into the BlockManager
for storage. BlockInfos are passed
The block size is configurable and that way I think you can reduce the
block interval, to keep the block in memory only for the limiter interval?
Is that what you are looking for?
On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang wbi...@gmail.com wrote:
Hi,
I'm learning Spark and I find there could
i would like to use objectFile with some tweaks to the hadoop conf.
currently there is no way to do that, except recreating objectFile myself.
and some of the code objectFile uses i have no access to, since its private
to spark.
On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com
Just a note, one challenge with the BYOH version might be that users who
download that can't run in local mode without also having Hadoop. But if we
describe it correctly then hopefully it's okay.
Matei
On Mar 24, 2015, at 3:05 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey All,
For
I think this would be a great addition, I totally agree that you need to be
able to set these at a finer context than just the SparkContext.
Just to play devil's advocate, though -- the alternative is for you just
subclass HadoopRDD yourself, or make a totally new RDD, and then you could
expose
That's correct. What's the reason this information is needed?
-Sandy
On Tue, Mar 24, 2015 at 11:41 AM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:
Thank you for your response!
I guess the (Spark)AM, who gives the container leash to the NM (along with
the executor JAR and command to run)
Hi all,
I just upgraded spark from 1.2.1 to 1.3.0, and changed the import
sqlContext.createSchemaRDD to import sqlContext.implicits._ in my code.
(I scan the programming guide and it seems this is the only change I need
to do). But it come to an error when run compile as following:
[ERROR]
Hi Kannan,
As I know the shuffle Id in ShuffleDependency will be increased, so even if
you run the same job twice, the shuffle dependency as well as shuffle id is
different, so the shuffle file name which is combined by
(shuffleId+mapId+reduceId) will be changed, so there's no name conflict
even
Two other criteria that I use when deciding what to backport:
- Is it a regression from a previous minor release? I'm much more likely
to backport fixes in this case, as I'd love for most people to stay up to
date.
- How scary is the change? I think the primary goal is stability of the
Imran, on your point to read multiple files together in a partition, is it
not simpler to use the approach of copy Hadoop conf and set per-RDD
settings for min split to control the input size per partition, together
with something like CombineFileInputFormat?
On Tue, Mar 24, 2015 at 5:28 PM,
My philosophy has been basically what you suggested, Sean. One thing
you didn't mention though is if a bug fix seems complicated, I will
think very hard before back-porting it. This is because fixes can
introduce their own new bugs, in some cases worse than the original
issue. It's really bad to
Yeah - to Nick's point, I think the way to do this is to pass in a
custom conf when you create a Hadoop RDD (that's AFAIK why the conf
field is there). Is there anything you can't do with that feature?
On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath
nick.pentre...@gmail.com wrote:
Imran, on
Hey All,
For a while we've published binary packages with different Hadoop
client's pre-bundled. We currently have three interfaces to a Hadoop
cluster (a) the HDFS client (b) the YARN client (c) the Hive client.
Because (a) and (b) are supposed to be backwards compatible
interfaces. My working
You can try to set it in spark-env.sh.
# - SPARK_LOG_DIR Where log files are stored. (Default:
${SPARK_HOME}/logs)
# - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp)
Thanks.
Zhan Zhang
On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal
Hi Zoltan,
If running on YARN, the YARN NodeManager starts executors. I don't think
there's a 100% precise way for the Spark executor way to know how many
resources are allotted to it. It can come close by looking at the Spark
configuration options used to request it (spark.executor.memory and
I am working on SPARK-1529. I ran into an issue with my change, where the
same shuffle file was being reused across 2 jobs. Please note this only
happens when I use a hard coded location to use for shuffle files, say
/tmp. It does not happen with normal code path that uses DiskBlockManager
to pick
Hi,
I'm learning Spark and I find there could be some optimize for the current
streaming implementation. Correct me if I'm wrong.
The current streaming implementation put the data of one batch into memory
(as RDD). But it seems not necessary.
For example, if I want to count the lines which
18 matches
Mail list logo