Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
​AFAIK Spark Streaming can not work in a way like this. Transformations are made on DStreams, where DStreams are basically hold (time, allocatedBlocksForBatch) pairs.​ Allocated blocks are allocated by the JobGenerator, unallocated blocks (infos) are collected by ReceivedBlockTracker. In Spark

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Zoltán Zvara
There is a BlockGenerator on each worker node next to the ReceiverSupervisorImpl, which generates Blocks out of an ArrayBuffer in each interval (block_interval). These Blocks are passed to ReceiverSupervisorImpl, which throws these blocks to into the BlockManager for storage. BlockInfos are passed

Re: Optimize the first map reduce of DStream

2015-03-24 Thread Arush Kharbanda
The block size is configurable and that way I think you can reduce the block interval, to keep the block in memory only for the limiter interval? Is that what you are looking for? On Tue, Mar 24, 2015 at 1:38 PM, Bin Wang wbi...@gmail.com wrote: Hi, I'm learning Spark and I find there could

Re: hadoop input/output format advanced control

2015-03-24 Thread Koert Kuipers
i would like to use objectFile with some tweaks to the hadoop conf. currently there is no way to do that, except recreating objectFile myself. and some of the code objectFile uses i have no access to, since its private to spark. On Tue, Mar 24, 2015 at 2:59 PM, Patrick Wendell pwend...@gmail.com

Re: Experience using binary packages on various Hadoop distros

2015-03-24 Thread Matei Zaharia
Just a note, one challenge with the BYOH version might be that users who download that can't run in local mode without also having Hadoop. But if we describe it correctly then hopefully it's okay. Matei On Mar 24, 2015, at 3:05 PM, Patrick Wendell pwend...@gmail.com wrote: Hey All, For

Re: hadoop input/output format advanced control

2015-03-24 Thread Imran Rashid
I think this would be a great addition, I totally agree that you need to be able to set these at a finer context than just the SparkContext. Just to play devil's advocate, though -- the alternative is for you just subclass HadoopRDD yourself, or make a totally new RDD, and then you could expose

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
That's correct. What's the reason this information is needed? -Sandy On Tue, Mar 24, 2015 at 11:41 AM, Zoltán Zvara zoltan.zv...@gmail.com wrote: Thank you for your response! I guess the (Spark)AM, who gives the container leash to the NM (along with the executor JAR and command to run)

Spark SQL(1.3.0) import sqlContext.implicits._ seems not work for converting a case class RDD to DataFrame

2015-03-24 Thread Zhiwei Chan
Hi all, I just upgraded spark from 1.2.1 to 1.3.0, and changed the import sqlContext.createSchemaRDD to import sqlContext.implicits._ in my code. (I scan the programming guide and it seems this is the only change I need to do). But it come to an error when run compile as following: [ERROR]

Re: Understanding shuffle file name conflicts

2015-03-24 Thread Saisai Shao
Hi Kannan, As I know the shuffle Id in ShuffleDependency will be increased, so even if you run the same job twice, the shuffle dependency as well as shuffle id is different, so the shuffle file name which is combined by (shuffleId+mapId+reduceId) will be changed, so there's no name conflict even

Re: Any guidance on when to back port and how far?

2015-03-24 Thread Michael Armbrust
Two other criteria that I use when deciding what to backport: - Is it a regression from a previous minor release? I'm much more likely to backport fixes in this case, as I'd love for most people to stay up to date. - How scary is the change? I think the primary goal is stability of the

Re: hadoop input/output format advanced control

2015-03-24 Thread Nick Pentreath
Imran, on your point to read multiple files together in a partition, is it not simpler to use the approach of copy Hadoop conf and set per-RDD settings for min split to control the input size per partition, together with something like CombineFileInputFormat? On Tue, Mar 24, 2015 at 5:28 PM,

Re: Any guidance on when to back port and how far?

2015-03-24 Thread Patrick Wendell
My philosophy has been basically what you suggested, Sean. One thing you didn't mention though is if a bug fix seems complicated, I will think very hard before back-porting it. This is because fixes can introduce their own new bugs, in some cases worse than the original issue. It's really bad to

Re: hadoop input/output format advanced control

2015-03-24 Thread Patrick Wendell
Yeah - to Nick's point, I think the way to do this is to pass in a custom conf when you create a Hadoop RDD (that's AFAIK why the conf field is there). Is there anything you can't do with that feature? On Tue, Mar 24, 2015 at 11:50 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Imran, on

Experience using binary packages on various Hadoop distros

2015-03-24 Thread Patrick Wendell
Hey All, For a while we've published binary packages with different Hadoop client's pre-bundled. We currently have three interfaces to a Hadoop cluster (a) the HDFS client (b) the YARN client (c) the Hive client. Because (a) and (b) are supposed to be backwards compatible interfaces. My working

Re: Spark-thriftserver Issue

2015-03-24 Thread Zhan Zhang
You can try to set it in spark-env.sh. # - SPARK_LOG_DIR Where log files are stored. (Default: ${SPARK_HOME}/logs) # - SPARK_PID_DIR Where the pid file is stored. (Default: /tmp) Thanks. Zhan Zhang On Mar 24, 2015, at 12:10 PM, Anubhav Agarwal

Re: Spark Executor resources

2015-03-24 Thread Sandy Ryza
Hi Zoltan, If running on YARN, the YARN NodeManager starts executors. I don't think there's a 100% precise way for the Spark executor way to know how many resources are allotted to it. It can come close by looking at the Spark configuration options used to request it (spark.executor.memory and

Understanding shuffle file name conflicts

2015-03-24 Thread Kannan Rajah
I am working on SPARK-1529. I ran into an issue with my change, where the same shuffle file was being reused across 2 jobs. Please note this only happens when I use a hard coded location to use for shuffle files, say /tmp. It does not happen with normal code path that uses DiskBlockManager to pick

Optimize the first map reduce of DStream

2015-03-24 Thread Bin Wang
Hi, I'm learning Spark and I find there could be some optimize for the current streaming implementation. Correct me if I'm wrong. The current streaming implementation put the data of one batch into memory (as RDD). But it seems not necessary. For example, if I want to count the lines which