Re: Recommendation for a good book on Spark, beginner to moderate knowledge

2016-02-28 Thread Nicos
I agree with suggestion to start with "Learning Spark" to further forge your 
knowledge of Spark fundamentals.

"Advanced Analytics with Spark" has good practical reinforcement of what you 
learn from the previous book. Though it is a bit advanced, in my opinion some 
practical/real applications are better covered in this book.

For DataFrame and other online Apache Spark documentation is still the best 
source.

Keep in mind Spark and its different subsystems are constantly evolving. 
Publications will be always somewhat outdated but not the key fundamental 
concepts.

Cheers,
- Nicos
+++ 


> On Feb 28, 2016, at 1:53 PM, Michał Zieliński  
> wrote:
> 
> Most of the books are outdated (don't include DataFrames or Spark ML and 
> focus on RDDs and MLlib). The one I particularly liked is "Learning Spark". 
> It starts from the basics, but has lots of useful tips on caching, 
> serialization etc.
> 
> The online docs are also of great quality.
> 
>> On 28 February 2016 at 21:48, Ashok Kumar  
>> wrote:
>>   Hi Gurus,
>> 
>> Appreciate if you recommend me a good book on Spark or documentation for 
>> beginner to moderate knowledge
>> 
>> I very much like to skill myself on transformation and action methods.
>> 
>> FYI, I have already looked at examples on net. However, some of them not 
>> clear at least to me.
>> 
>> Warmest regards
> 


Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Nicos
Folks,
Just a gentle reminder we owe to ourselves:
- this is a public forum and we need to behave accordingly, it is not place to 
vent frustration in rude way
- getting attention here is an earned privilege and not entitlement
- this is not a “Platinum Support” department of your vendor rather and open 
source collaboration forum where people volunteer their time to pay attention 
to your needs
- there are still many gray areas so be patient and articulate questions in as 
much details as possible if you want to get quick help and not just be 
perceived as a smart a$$

FYI - Paco Nathan is a well respected Spark evangelist and many people, 
including myself, owe to his passion for jumping on Spark platform promise. 
People like Sean Owen keep us believing in things when we feel like hitting the 
dead-end.

Please, be respectful of what connections you are prized with and act civilized.

Have a great day!
- Nicos


> On Jan 22, 2015, at 7:49 AM, Sean Owen  wrote:
> 
> Yes, this isn't a well-formed question, and got maybe the response it
> deserved, but the tone is veering off the rails. I just got a much
> ruder reply from Sudipta privately, which I will not forward. Sudipta,
> I suggest you take the responses you've gotten so far as about as much
> answer as can be had here and do some work yourself, and come back
> with much more specific questions, and it will all be helpful and
> polite again.
> 
> On Thu, Jan 22, 2015 at 2:51 PM, Sudipta Banerjee
>  wrote:
>> Hi Marco,
>> 
>> Thanks for the confirmation. Please let me know what are the lot more detail
>> you need to answer a very specific question  WHAT IS THE MINIMUM HARDWARE
>> CONFIGURATION REQUIRED TO BUILT HDFS+ MAPREDUCE+SPARK+YARN  on a system?
>> Please let me know if you need any further information and if you dont know
>> please drive across with the $1 to Sir Paco Nathan and get me the
>> answer.
>> 
>> Thanks and Regards,
>> Sudipta
>> 
>> On Thu, Jan 22, 2015 at 5:33 PM, Marco Shaw  wrote:
>>> 
>>> Hi,
>>> 
>>> Let me reword your request so you understand how (too) generic your
>>> question is
>>> 
>>> "Hi, I have $10,000, please find me some means of transportation so I can
>>> get to work."
>>> 
>>> Please provide (a lot) more details. If you can't, consider using one of
>>> the pre-built express VMs from either Cloudera, Hortonworks or MapR, for
>>> example.
>>> 
>>> Marco
>>> 
>>> 
>>> 
>>>> On Jan 22, 2015, at 7:36 AM, Sudipta Banerjee
>>>>  wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi Apache-Spark team ,
>>>> 
>>>> What are the system requirements installing Hadoop and Apache Spark?
>>>> I have attached the screen shot of Gparted.
>>>> 
>>>> 
>>>> Thanks and regards,
>>>> Sudipta
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sudipta Banerjee
>>>> Consultant, Business Analytics and Cloud Based Architecture
>>>> Call me +919019578099
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> 
>> 
>> 
>> --
>> Sudipta Banerjee
>> Consultant, Business Analytics and Cloud Based Architecture
>> Call me +919019578099
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Some tasks are taking long time

2015-01-15 Thread Nicos
Ajay,
Unless we are dealing with some synchronization/conditional variable 
bug in Spark, try this per tuning guide:
Cache Size Tuning

One important configuration parameter for GC is the amount of memory that 
should be used for caching RDDs. By default, Spark uses 60% of the configured 
executor memory (spark.executor.memory) to cache RDDs. This means that 40% of 
memory is available for any objects created during task execution.

In case your tasks slow down and you find that your JVM is garbage-collecting 
frequently or running out of memory, lowering this value will help reduce the 
memory consumption. To change this to, say, 50%, you can call 
conf.set("spark.storage.memoryFraction", "0.5") on your SparkConf. Combined 
with the use of serialized caching, using a smaller cache should be sufficient 
to mitigate most of the garbage collection problems. In case you are interested 
in further tuning the Java GC, continue reading below.


Complete list of tips here:
https://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage 
<https://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage>

Cheers,
- Nicos

> On Jan 15, 2015, at 6:49 AM, Ajay Srivastava 
>  wrote:
> 
> Thanks RK. I can turn on speculative execution but I am trying to find out 
> actual reason for delay as it happens on any node. Any idea about the stack 
> trace in my previous mail.
> 
> Regards,
> Ajay
> 
> 
> On Thursday, January 15, 2015 8:02 PM, RK  wrote:
> 
> 
> If you don't want a few slow tasks to slow down the entire job, you can turn 
> on speculation. 
> 
> Here are the speculation settings from Spark Configuration - Spark 1.2.0 
> Documentation <http://spark.apache.org/docs/1.2.0/configuration.html>.
>  
>  
>  
>  
>  
>  
> Spark Configuration - Spark 1.2.0 Documentation
>  <http://spark.apache.org/docs/1.2.0/configuration.html>Spark Configuration 
> Spark Properties Dynamically Loading Spark Properties Viewing Spark 
> Properties Available Properties Application Properties Runtime Environment 
> Shuffle Behavior Spark UI
> View on spark.apache.org 
> <http://spark.apache.org/docs/1.2.0/configuration.html>  
> Preview by Yahoo
>  
> 
> spark.speculation false   If set to "true", performs speculative 
> execution of tasks. This means if one or more tasks are running slowly in a 
> stage, they will be re-launched.
> spark.speculation.interval100 How often Spark will check for tasks to 
> speculate, in milliseconds.
> spark.speculation.quantile0.75Percentage of tasks which must be 
> complete before speculation is enabled for a particular stage.
> spark.speculation.multiplier  1.5 
> How many times slower a task is than the median to be considered for 
> speculation.
> 
>  
> 
> 
> On Thursday, January 15, 2015 5:44 AM, Ajay Srivastava 
>  wrote:
> 
> 
> Hi,
> 
> My spark job is taking long time. I see that some tasks are taking longer 
> time for same amount of data and shuffle read/write. What could be the 
> possible reasons for it ?
> 
> The thread-dump sometimes show that all the tasks in an executor are waiting 
> with following stack trace -
> 
> "Executor task launch worker-12" daemon prio=10 tid=0x7fcd44276000 
> nid=0x3f85 waiting on condition [0x7fcce3ddc000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7fd0aee82e00> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at java.util.concurrent.locks.LockSupport.park(Unknown Source)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
>  Source)
> at java.util.concurrent.LinkedBlockingQueue.take(Unknown Source)
> at 
> org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.next(BlockFetcherIterator.scala:253)
> at 
> org.apache.spark.storage.BlockFetcherIterator$BasicBlockFetcherIterator.next(BlockFetcherIterator.scala:77)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
> at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:137)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> at 
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.appl