Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Makes sense. "/" is where /tmp would be. However, 230G should be plenty of
space. If you have INFO logging turned on (set in
$SPARK_HOME/conf/log4j.properties), you'll see messages about saving data
to disk that will list sizes. The web console also has some summary
information about this.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Apr 29, 2015 at 6:25 AM, selim namsi  wrote:

> This is the output of df -h so as you can see I'm using only one disk
> mounted on /
>
> df -h
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sda8   276G   34G  229G  13% /none4.0K 0  4.0K   0% 
> /sys/fs/cgroup
> udev7.8G  4.0K  7.8G   1% /dev
> tmpfs   1.6G  1.4M  1.6G   1% /runnone5.0M 0  5.0M   
> 0% /run/locknone7.8G   37M  7.8G   1% /run/shmnone
> 100M   40K  100M   1% /run/user
> /dev/sda1   496M   55M  442M  11% /boot/efi
>
> Also when running the program, I noticed that the Used% disk space related
> to the partition mounted on "/" was growing very fast
>
> On Wed, Apr 29, 2015 at 12:19 PM Anshul Singhle 
> wrote:
>
>> Do you have multiple disks? Maybe your work directory is not in the right
>> disk?
>>
>> On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi 
>> wrote:
>>
>>> Hi,
>>>
>>> I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
>>> output,the training data is a file containing 156060 (size 8.1M).
>>>
>>> The problem is that when trying to presist a partition into memory and
>>> there
>>> is not enought memory, the partition is persisted on disk and despite
>>> Having
>>> 229G of free disk space, I got " No space left on device"..
>>>
>>> This is how I'm running the program :
>>>
>>> ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
>>> local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
>>> testData.tsv
>>>
>>> And this is a part of the log:
>>>
>>>
>>>
>>> If you need more informations, please let me know.
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>


Re: java.io.IOException: No space left on device

2015-04-29 Thread Dean Wampler
Or multiple volumes. The LOCAL_DIRS (YARN) and SPARK_LOCAL_DIRS (Mesos,
Standalone) environment variables and the spark.local.dir property control
where temporary data is written. The default is /tmp.

See
http://spark.apache.org/docs/latest/configuration.html#runtime-environment
for more details.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Apr 29, 2015 at 6:19 AM, Anshul Singhle 
wrote:

> Do you have multiple disks? Maybe your work directory is not in the right
> disk?
>
> On Wed, Apr 29, 2015 at 4:43 PM, Selim Namsi 
> wrote:
>
>> Hi,
>>
>> I'm using spark (1.3.1) MLlib to run random forest algorithm on tfidf
>> output,the training data is a file containing 156060 (size 8.1M).
>>
>> The problem is that when trying to presist a partition into memory and
>> there
>> is not enought memory, the partition is persisted on disk and despite
>> Having
>> 229G of free disk space, I got " No space left on device"..
>>
>> This is how I'm running the program :
>>
>> ./spark-submit --class com.custom.sentimentAnalysis.MainPipeline --master
>> local[2] --driver-memory 5g ml_pipeline.jar labeledTrainData.tsv
>> testData.tsv
>>
>> And this is a part of the log:
>>
>>
>>
>> If you need more informations, please let me know.
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/java-io-IOException-No-space-left-on-device-tp22702.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Questions about Accumulators

2015-05-03 Thread Dean Wampler
Yes, correct.

However, note that when an accumulator operation is *idempotent*, meaning
that repeated application for the same data behaves exactly like one
application, then that accumulator can be safely called in transformation
steps (non-actions), too.

For example, max and min tracking. Just last week I wrote one that used a
hash map to track the latest timestamps seen for specific keys.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, May 3, 2015 at 8:07 AM, xiazhuchang  wrote:

> “For accumulator updates performed inside actions only, Spark guarantees
> that
> each task’s update to the accumulator will only be applied once, i.e.
> restarted tasks will not update the value. In transformations, users should
> be aware of that each task’s update may be applied more than once if tasks
> or job stages are re-executed. ”
> Is this mean the guarantees(accumulator only be updated once) only in
> actions? That is to say, one should use the accumulator only in actions,
> orelse there may be some errors(update more than once) if used in
> transformations?
> e.g. map(x => accumulator += x)
> After executed, the correct result of accumulator should be "1";
> Unfortunately, some errors happened, restart task, the map() operation
> re-executed(map(x => accumulator += x)  re-executed), then the final result
> of acculumator will be "2", twice as the correct result?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Questions-about-Accumulators-tp22746p22747.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: PriviledgedActionException- Executor error

2015-05-03 Thread Dean Wampler
So, it looks like the Akka's remote call from your master node, where the
CoarseGrainedExecutorBackend is running, to one or more slave nodes is
timing out.

By default, the port is 2552. Do you have a firewall between the nodes in
your cluster that might be blocking this port. If you're not sure, try
logging into your master and running the command "telnet slave_addr 2552"
where "slave_addr" is one of the slave's IP addresses or a routable host
name.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, May 3, 2015 at 4:25 AM, podioss  wrote:

> Hi,
> i am running several jobs in standalone mode and i notice this error in the
> log files in some of my nodes at the start of my jobs:
>
> INFO executor.CoarseGrainedExecutorBackend: Registered signal handlers for
> [TERM, HUP, INT]
> INFO spark.SecurityManager: Changing view acls to: root
> INFO spark.SecurityManager: Changing modify acls to: root
> INFO spark.SecurityManager: SecurityManager: authentication disabled; ui
> acls disabled; users with view permissions: NFO slf4j.Slf4jLogger:
> Slf4jLogger started
> INFO Remoting: Starting remoting
> ERROR security.UserGroupInformation: PriviledgedActionException as:root
> cause:java.util.concurrent.TimeoutException: Futures timed out after [1
> milliseconds]
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException:
> Unknown exception in doAs
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1134)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:59)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:115)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:163)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.security.PrivilegedActionException:
> java.util.concurrent.TimeoutException: Futures timed out after [1
> milliseconds]
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
> ... 4 more
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [1 milliseconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at akka.remote.Remoting.start(Remoting.scala:180)
> at
> akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
> at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
> at
> akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
> at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
> at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
> at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
> at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
> at
>
> org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
> at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
> at
> org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
> at
>
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1676)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> at
> org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1667)
> at
> org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
> at
>
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:122)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:60)
> at
>
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:59)
> ... 7 more
> INFO remote.RemoteActorRefProvider$Re

Re: Spark distributed SQL: JSON Data set on all worker node

2015-05-03 Thread Dean Wampler
Note that each JSON object has to be on a single line in the files.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, May 3, 2015 at 4:14 AM, ayan guha  wrote:

> Yes it is possible. You need to use jsonfile method on SQL context and
> then create a dataframe from the rdd. Then register it as a table. Should
> be 3 lines of code, thanks to spark.
>
> You may see few YouTube video esp for unifying pipelines.
> On 3 May 2015 19:02, "Jai"  wrote:
>
>> Hi,
>>
>> I am noob to spark and related technology.
>>
>> i have JSON stored at same location on all worker clients spark cluster).
>> I am looking to load JSON data set on these clients and do SQL query, like
>> distributed SQL.
>>
>> is it possible to achieve?
>>
>> right now, master submits task to one node only.
>>
>> Thanks and regards
>> Mrityunjay
>>
>


Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
How big is the data you're returning to the driver with collectAsMap? You
are probably running out of memory trying to copy too much data back to it.

If you're trying to force a map-side join, Spark can do that for you in
some cases within the regular DataFrame/RDD context. See
http://spark.apache.org/docs/latest/sql-programming-guide.html#performance-tuning
and this talk by Michael Armbrust for example,
http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark-SQL-Michael-Armbrust.pdf.


dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Thu, Apr 30, 2015 at 12:40 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Full Exception
> *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Stage 1 (collectAsMap at
> VISummaryDataProvider.scala:37) failed in 884.087 s*
> *15/04/30 09:59:49 INFO scheduler.DAGScheduler: Job 0 failed: collectAsMap
> at VISummaryDataProvider.scala:37, took 1093.418249 s*
> 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
> exception: Job aborted due to stage failure: Exception while getting task
> result: org.apache.spark.SparkException: Error sending message [message =
> GetLocations(taskresult_112)]
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result: org.apache.spark.SparkException: Error
> sending message [message = GetLocations(taskresult_112)]
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/04/30 09:59:49 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception: Job aborted due to stage
> failure: Exception while getting task result:
> org.apache.spark.SparkException: Error sending message [message =
> GetLocations(taskresult_112)])
>
>
> *Code at line 37*
>
> val lstgItemMap = listings.map { lstg => (lstg.getItemId().toLong, lstg) }
> .collectAsMap
>
> Listing data set size is 26G (10 files) and my driver memory is 12G (I
> cant go beyond it). The reason i do collectAsMap is to brodcast it and do a
> map-side join instead of regular join.
>
>
> Please suggest ?
>
>
> On Thu, Apr 30, 2015 at 10:52 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
> wrote:
>
>> My Spark Job is failing  and i see
>>
>> ==
>>
>> 15/04/30 09:59:49 ERROR yarn.ApplicationMaster: User class threw
>> exception: Job aborted due to stage failure: Exception while getting task
>> result: org.apache.spark.SparkException: Error sending message [message =
>> GetLocations(taskresult_112)]
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure:
>> Exception while getting task result: org.apache.spark.SparkException: Error
>> sending message [message = GetLocations(taskresult_112)]
>>
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
>>
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
>>
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
>>
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>
>> at
>> org.apache.spark.scheduler.DAGSchedul

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
IMHO, you are trying waaay to hard to optimize work on what is really a
small data set. 25G, even 250G, is not that much data, especially if you've
spent a month trying to get something to work that should be simple. All
these errors are from optimization attempts.

Kryo is great, but if it's not working reliably for some reason, then don't
use it. Rather than force 200 partitions, let Spark try to figure out a
good-enough number. (If you really need to force a partition count, use the
repartition method instead, unless you're overriding the partitioner.)

So. I recommend that you eliminate all the optimizations: Kryo,
partitionBy, etc. Just use the simplest code you can. Make it work first.
Then, if it really isn't fast enough, look for actual evidence of
bottlenecks and optimize those.



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Hello Dean & Others,
> Thanks for your suggestions.
> I have two data sets and all i want to do is a simple equi join. I have
> 10G limit and as my dataset_1 exceeded that it was throwing OOM error.
> Hence i switched back to use .join() API instead of map-side broadcast
> join.
> I am repartitioning the data with 100,200 and i see a NullPointerException
> now.
>
> When i run against 25G of each input and with .partitionBy(new
> org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption
>
>
> this trace does not include a line from my code and hence i do not what is
> causing error ?
> I do have registered kryo serializer.
>
> val conf = new SparkConf()
>   .setAppName(detail)
> *  .set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")*
>   .set("spark.kryoserializer.buffer.mb",
> arguments.get("buffersize").get)
>   .set("spark.kryoserializer.buffer.max.mb",
> arguments.get("maxbuffersize").get)
>   .set("spark.driver.maxResultSize",
> arguments.get("maxResultSize").get)
>   .set("spark.yarn.maxAppAttempts", "0")
> * 
> .registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve*
> lMetricSum]))
> val sc = new SparkContext(conf)
>
> I see the exception when this task runs
>
> val viEvents = details.map { vi => (vi.get(14).asInstanceOf[Long], vi) }
>
> Its a simple mapping of input records to (itemId, record)
>
> I found this
>
> http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
> and
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
>
> Looks like Kryo (2.21v)  changed something to stop using default
> constructors.
>
> (Kryo.DefaultInstantiatorStrategy) 
> kryo.getInstantiatorStrategy()).setFallbackInstantiatorStrategy(new 
> StdInstantiatorStrategy());
>
>
> Please suggest
>
>
> Trace:
> 15/05/01 03:02:15 ERROR executor.Executor: Exception in task 110.1 in
> stage 2.0 (TID 774)
> com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
> Serialization trace:
> values (org.apache.avro.generic.GenericData$Record)
> datum (org.apache.avro.mapred.AvroKey)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
> at
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:41)
> at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> Regards,
>
>
> Any suggestions.
> I am not able to get this thing to work over a month now, its kind of
> getting frustrating.
>
> On Sun, May 3, 2015 at 8:03 PM, Dean Wampler 
> wrote:
>
>> How big is the data you're returning to the driver with collectAsMap? You
>> are probably running out of memory trying to copy too much data back to it.
>>
>> If you're trying to force a map-side join, Spark can do that for you in
>> some cases within the regular DataFrame/RDD context. Se

Re: Spark - Timeout Issues - OutOfMemoryError

2015-05-03 Thread Dean Wampler
I don't know the full context of what you're doing, but serialization
errors usually mean you're attempting to serialize something that can't be
serialized, like the SparkContext. Kryo won't help there.

The arguments to spark-submit you posted previously look good:

2)  --num-executors 96 --driver-memory 12g --driver-java-options
"-XX:MaxPermSize=10G" --executor-memory 12g --executor-cores 4

I suspect you aren't getting the parallelism you need. For partitioning, if
your data is in HDFS and your block size is 128MB, then you'll get ~195
partitions anyway. If it takes 7 hours to do a join over 25GB of data, you
have some other serious bottleneck. You should examine the web console and
the logs to determine where all the time is going. Questions you might
pursue:

   - How long does each task take to complete?
   - How many of those 195 partitions/tasks are processed at the same time?
   That is, how many "slots" are available?  Maybe you need more nodes if the
   number of slots is too low. Based on your command arguments, you should be
   able to process 1/2 of them at a time, unless the cluster is busy.
   - Is the cluster swamped with other work?
   - How much data does each task process? Is the data roughly the same
   from one task to the next? If not, then you might have serious key skew?

You may also need to research the details of how joins are implemented and
some of the common tricks for organizing data to minimize having to shuffle
all N by M records.



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, May 3, 2015 at 11:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Hello Deam,
> If I don;t use Kryo serializer i got Serialization error and hence am
> using it.
> If I don';t use partitionBy/reparition then the simply join never
> completed even after 7 hours and infact as next step i need to run it
> against 250G as that is my full dataset size. Someone here suggested to me
> to use repartition.
>
> Assuming reparition is mandatory , how do i decide whats the right number
> ? When i am using 400 i do not get NullPointerException that i talked
> about, which is strange. I never saw that exception against small random
> dataset but see it with 25G and again with 400 partitions , i do not see it.
>
>
> On Sun, May 3, 2015 at 9:15 PM, Dean Wampler 
> wrote:
>
>> IMHO, you are trying waaay to hard to optimize work on what is really a
>> small data set. 25G, even 250G, is not that much data, especially if you've
>> spent a month trying to get something to work that should be simple. All
>> these errors are from optimization attempts.
>>
>> Kryo is great, but if it's not working reliably for some reason, then
>> don't use it. Rather than force 200 partitions, let Spark try to figure out
>> a good-enough number. (If you really need to force a partition count, use
>> the repartition method instead, unless you're overriding the partitioner.)
>>
>> So. I recommend that you eliminate all the optimizations: Kryo,
>> partitionBy, etc. Just use the simplest code you can. Make it work first.
>> Then, if it really isn't fast enough, look for actual evidence of
>> bottlenecks and optimize those.
>>
>>
>>
>> Dean Wampler, Ph.D.
>> Author: Programming Scala, 2nd Edition
>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>> Typesafe <http://typesafe.com>
>> @deanwampler <http://twitter.com/deanwampler>
>> http://polyglotprogramming.com
>>
>> On Sun, May 3, 2015 at 10:22 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) 
>> wrote:
>>
>>> Hello Dean & Others,
>>> Thanks for your suggestions.
>>> I have two data sets and all i want to do is a simple equi join. I have
>>> 10G limit and as my dataset_1 exceeded that it was throwing OOM error.
>>> Hence i switched back to use .join() API instead of map-side broadcast
>>> join.
>>> I am repartitioning the data with 100,200 and i see a
>>> NullPointerException now.
>>>
>>> When i run against 25G of each input and with .partitionBy(new
>>> org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption
>>>
>>>
>>> this trace does not include a line from my code and hence i do not what
>>> is causing error ?
>>> I do have registered kryo serializer.
>>>
>>> val conf = new SparkConf()
>>>   .setAppName(detail)
>>> *  .set("spark.serializer",
>>> "org.apache.spark

Re: value toDF is not a member of RDD object

2015-05-12 Thread Dean Wampler
It's the import statement Olivier showed that makes the method available.

Note that you can also use `sc.createDataFrame(myRDD)`, without the need
for the import statement. I personally prefer this approach.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Tue, May 12, 2015 at 9:33 AM, Olivier Girardot  wrote:

> you need to instantiate a SQLContext :
> val sc : SparkContext = ...
> val sqlContext = new SQLContext(sc)
> import sqlContext.implicits._
>
> Le mar. 12 mai 2015 à 12:29, SLiZn Liu  a écrit :
>
>> I added `libraryDependencies += "org.apache.spark" % "spark-sql_2.11" %
>> "1.3.1"` to `build.sbt` but the error remains. Do I need to import modules
>> other than `import org.apache.spark.sql.{ Row, SQLContext }`?
>>
>> On Tue, May 12, 2015 at 5:56 PM Olivier Girardot 
>> wrote:
>>
>>> toDF is part of spark SQL so you need Spark SQL dependency + import
>>> sqlContext.implicits._ to get the toDF method.
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>> Le mar. 12 mai 2015 à 11:36, SLiZn Liu  a
>>> écrit :
>>>
>>>> Hi User Group,
>>>>
>>>> I’m trying to reproduce the example on Spark SQL Programming Guide
>>>> <https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection>,
>>>> and got a compile error when packaging with sbt:
>>>>
>>>> [error] myfile.scala:30: value toDF is not a member of 
>>>> org.apache.spark.rdd.RDD[Person]
>>>> [error] val people = 
>>>> sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p
>>>>  => Person(p(0), p(1).trim.toInt)).toDF()
>>>> [error]
>>>>   ^
>>>> [error] one error found
>>>> [error] (compile:compileIncremental) Compilation failed
>>>> [error] Total time: 3 s, completed May 12, 2015 4:11:53 PM
>>>>
>>>> I double checked my code includes import sqlContext.implicits._ after
>>>> reading this post
>>>> <https://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/%3c1426522113299-22083.p...@n3.nabble.com%3E>
>>>> on spark mailing list, even tried to use toDF("col1", "col2")
>>>> suggested by Xiangrui Meng in that post and got the same error.
>>>>
>>>> The Spark version is specified in build.sbt file as follows:
>>>>
>>>> scalaVersion := "2.11.6"
>>>> libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.3.1" % 
>>>> "provided"
>>>> libraryDependencies += "org.apache.spark" % "spark-mllib_2.11" % "1.3.1"
>>>>
>>>> Anyone have ideas the cause of this error?
>>>>
>>>> REGARDS,
>>>> Todd Leo
>>>> ​
>>>>
>>>


Re: Spark SQL: preferred syntax for column reference?

2015-05-13 Thread Dean Wampler
Is the $"foo" or mydf("foo") or both checked at compile time to verify that
the column reference is valid? Thx.

Dean

On Wednesday, May 13, 2015, Michael Armbrust  wrote:

> I would not say that either method is preferred (neither is
> old/deprecated).  One advantage to the second is that you are referencing a
> column from a specific dataframe, instead of just providing a string that
> will be resolved much like an identifier in a SQL query.
>
> This means given:
> df1 = [id: int, name: string ]
> df2 = [id: int, zip: int]
>
> I can do something like:
>
> df1.join(df2, df1("id") === df2("id"))
>
> Where as I would need aliases if I was only using strings:
>
> df1.as("a").join(df2.as("b"), $"a.id" === $"b.id")
>
> On Wed, May 13, 2015 at 9:55 AM, Diana Carroll  > wrote:
>
>> I'm just getting started with Spark SQL and DataFrames in 1.3.0.
>>
>> I notice that the Spark API shows a different syntax for referencing
>> columns in a dataframe than the Spark SQL Programming Guide.
>>
>> For instance, the API docs for the select method show this:
>> df.select($"colA", $"colB")
>>
>>
>> Whereas the programming guide shows this:
>> df.filter(df("name") > 21).show()
>>
>> I tested and both the $"column" and df(column) syntax works, but I'm
>> wondering which is *preferred*.  Is one the original and one a new
>> feature we should be using?
>>
>> Thanks,
>> Diana
>> (Spark Curriculum Developer for Cloudera)
>>
>
>

-- 
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com


Re: Spark vs Google cloud dataflow

2014-06-27 Thread Dean Wampler
... and to be clear on the point, Summingbird is not limited to MapReduce.
It abstracts over Scalding (which abstracts over Cascading, which is being
moved from MR to Spark) and over Storm for event processing.


On Fri, Jun 27, 2014 at 7:16 AM, Sean Owen  wrote:

> On Thu, Jun 26, 2014 at 9:15 AM, Aureliano Buendia 
> wrote:
> > Summingbird is for map/reduce. Dataflow is the third generation of
> google's
> > map/reduce, and it generalizes map/reduce the way Spark does. See more
> about
> > this here: http://youtu.be/wtLJPvx7-ys?t=2h37m8s
>
> Yes, my point was that Summingbird is similar in that it is a
> higher-level service for batch/streaming computation, not that it is
> similar for being MapReduce-based.
>
> > It seems Dataflow is based on this paper:
> > http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf
>
> FlumeJava maps to Crunch in the Hadoop ecosystem. I think Dataflows is
> more than that but yeah that seems to be some of the 'language'. It is
> similar in that it is a distributed collection abstraction.
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Recommended pipeline automation tool? Oozie?

2014-07-15 Thread Dean Wampler
If you're already using Scala for Spark programming and you hate Oozie XML
as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie:
https://github.com/klout/scoozie


On Thu, Jul 10, 2014 at 5:52 PM, Andrei  wrote:

> I used both - Oozie and Luigi - but found them inflexible and still
> overcomplicated, especially in presence of Spark.
>
> Oozie has a fixed list of building blocks, which is pretty limiting. For
> example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are
> out of scope (of course, you can always write wrapper as Java or Shell
> action, but does it really need to be so complicated?). Another issue with
> Oozie is passing variables between actions. There's Oozie context that is
> suitable for passing key-value pairs (both strings) between actions, but
> for more complex objects (say, FileInputStream that should be closed at
> last step only) you have to do some advanced kung fu.
>
> Luigi, on other hand, has its niche - complicated dataflows with many
> tasks that depend on each other. Basically, there are tasks (this is where
> you define computations) and targets (something that can "exist" - file on
> disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it
> creates a plan for achieving this. Luigi is really shiny when your workflow
> fits this model, but one step away and you are in trouble. For example,
> consider simple pipeline: run MR job and output temporary data, run another
> MR job and output final data, clean temporary data. You can make target
> Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1,
> right? Not so easy. How do you check that Clean task is achieved? If you
> just test whether temporary directory is empty or not, you catch both cases
> - when all tasks are done and when they are not even started yet. Luigi
> allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single
> "run()" method, but ruins the entire idea.
>
> And of course, both of these frameworks are optimized for standard
> MapReduce jobs, which is probably not what you want on Spark mailing list
> :)
>
> Experience with these frameworks, however, gave me some insights about
> typical data pipelines.
>
> 1. Pipelines are mostly linear. Oozie, Luigi and number of other
> frameworks allow branching, but most pipelines actually consist of moving
> data from source to destination with possibly some transformations in
> between (I'll be glad if somebody share use cases when you really need
> branching).
> 2. Transactional logic is important. Either everything, or nothing.
> Otherwise it's really easy to get into inconsistent state.
> 3. Extensibility is important. You never know what will need in a week or
> two.
>
> So eventually I decided that it is much easier to create your own pipeline
> instead of trying to adopt your code to existing frameworks. My latest
> pipeline incarnation simply consists of a list of steps that are started
> sequentially. Each step is a class with at least these methods:
>
>  * run() - launch this step
>  * fail() - what to do if step fails
>  * finalize() - (optional) what to do when all steps are done
>
> For example, if you want to add possibility to run Spark jobs, you just
> create SparkStep and configure it with required code. If you want Hive
> query - just create HiveStep and configure it with Hive connection
> settings. I use YAML file to configure steps and Context (basically,
> Map[String, Any]) to pass variables between them. I also use configurable
> Reporter available for all steps to report the progress.
>
> Hopefully, this will give you some insights about best pipeline for your
> specific case.
>
>
>
> On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown  wrote:
>
>>
>> We use Luigi for this purpose.  (Our pipelines are typically on AWS (no
>> EMR) backed by S3 and using combinations of Python jobs, non-Spark
>> Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to
>> the master, and those are what is invoked from Luigi.)
>>
>> —
>> p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
>>
>>
>> On Thu, Jul 10, 2014 at 10:20 AM, k.tham  wrote:
>>
>>> I'm just wondering what's the general recommendation for data pipeline
>>> automation.
>>>
>>> Say, I want to run Spark Job A, then B, then invoke script C, then do D,
>>> and
>>> if D fails, do E, and if Job A fails, send email F, etc...
>>>
>>> It looks like Oozie might be the best choice. But I'd like some
>>> advice/suggestions.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Dean Wampler
The stack trace suggests it was trying to create a temporary file, not read
your file. Of course, it doesn't say what file it couldn't create.

Could there be a configuration file, like a Hadoop config file, that was
read with a temp dir setting that's invalid for your machine?

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com


On Thu, Jul 31, 2014 at 4:04 PM, Ryan Tabora  wrote:

> Hey all,
>
> I was able to spawn up a cluster, but when I'm trying to submit a simple
> jar via spark-submit it fails to run. I am trying to run the simple
> "Standalone Application" from the quickstart.
>
> Oddly enough, I could get another application running through the
> spark-shell. What am I doing wrong here? :(
>
> http://spark.apache.org/docs/latest/quick-start.html
>
> * Here's my setup: *
>
> $ ls
> project  simple.sbt  src  target
>
> $ ls -R src
> src:
> main
>
> src/main:
> scala
>
> src/main/scala:
> SimpleApp.scala
>
> $ cat src/main/scala/SimpleApp.scala
> package main.scala
>
> /* SimpleApp.scala */
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
>
> object SimpleApp {
> def main(args: Array[String]) {
>   val logFile = "/tmp/README.md"
>   val conf = new SparkConf().setAppName("Simple Application")
>   val sc = new SparkContext(conf)
>   val logData = sc.textFile(logFile, 2).cache()
>   val numAs = logData.filter(line =>
> line.contains("a")).count()
>   val numBs = logData.filter(line =>
> line.contains("b")).count()
>   println("Lines with a: %s, Lines with b:
> %s".format(numAs, numBs))
> }
> }
>
> $ cat simple.sbt
> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion := "2.10.4"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.1"
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>
> * Here's how I run the job: *
>
> $ /root/spark/bin/spark-submit --class "main.scala.SimpleApp" --master
> local[4] ./target/scala-2.10/simple-project_2.10-1.0.jar
>
> *Here is the error: *
>
> 14/07/31 16:23:56 INFO scheduler.DAGScheduler: Failed to run count at
> SimpleApp.scala:14
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0.0:1 failed 1 times, most recent failure:
> Exception failure in TID 1 on host localhost: java.io.IOException: No such
> file or directory
> java.io.UnixFileSystem.createFileExclusively(Native Method)
> java.io.File.createNewFile(File.java:1006)
> java.io.File.createTempFile(File.java:1989)
> org.apache.spark.util.Utils$.fetchFile(Utils.scala:326)
>
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:332)
>
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
> scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> org.apache.spark.executor.Executor.org
> $apache$spark$executor$Executor$$updateDependencies(Executor.scala:330)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:168)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
> 

Re: Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Dean Wampler
Forgot to add that I tried your program with the same input file path. It
worked fine. (I used local[2], however...)

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com


On Thu, Jul 31, 2014 at 5:07 PM, Dean Wampler  wrote:

> The stack trace suggests it was trying to create a temporary file, not
> read your file. Of course, it doesn't say what file it couldn't create.
>
> Could there be a configuration file, like a Hadoop config file, that was
> read with a temp dir setting that's invalid for your machine?
>
> dean
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
>
> On Thu, Jul 31, 2014 at 4:04 PM, Ryan Tabora  wrote:
>
>> Hey all,
>>
>> I was able to spawn up a cluster, but when I'm trying to submit a simple
>> jar via spark-submit it fails to run. I am trying to run the simple
>> "Standalone Application" from the quickstart.
>>
>> Oddly enough, I could get another application running through the
>> spark-shell. What am I doing wrong here? :(
>>
>> http://spark.apache.org/docs/latest/quick-start.html
>>
>> * Here's my setup: *
>>
>> $ ls
>> project  simple.sbt  src  target
>>
>> $ ls -R src
>> src:
>> main
>>
>> src/main:
>> scala
>>
>> src/main/scala:
>> SimpleApp.scala
>>
>> $ cat src/main/scala/SimpleApp.scala
>> package main.scala
>>
>> /* SimpleApp.scala */
>> import org.apache.spark.SparkContext
>> import org.apache.spark.SparkContext._
>> import org.apache.spark.SparkConf
>>
>> object SimpleApp {
>> def main(args: Array[String]) {
>>   val logFile = "/tmp/README.md"
>>   val conf = new SparkConf().setAppName("Simple Application")
>>   val sc = new SparkContext(conf)
>>   val logData = sc.textFile(logFile, 2).cache()
>>   val numAs = logData.filter(line =>
>> line.contains("a")).count()
>>   val numBs = logData.filter(line =>
>> line.contains("b")).count()
>>   println("Lines with a: %s, Lines with
>> b: %s".format(numAs, numBs))
>> }
>> }
>>
>> $ cat simple.sbt
>> name := "Simple Project"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.4"
>>
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.1"
>>
>> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>>
>> * Here's how I run the job: *
>>
>> $ /root/spark/bin/spark-submit --class "main.scala.SimpleApp" --master
>> local[4] ./target/scala-2.10/simple-project_2.10-1.0.jar
>>
>> *Here is the error: *
>>
>> 14/07/31 16:23:56 INFO scheduler.DAGScheduler: Failed to run count at
>> SimpleApp.scala:14
>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>> due to stage failure: Task 0.0:1 failed 1 times, most recent failure:
>> Exception failure in TID 1 on host localhost: java.io.IOException: No such
>> file or directory
>> java.io.UnixFileSystem.createFileExclusively(Native Method)
>> java.io.File.createNewFile(File.java:1006)
>> java.io.File.createTempFile(File.java:1989)
>> org.apache.spark.util.Utils$.fetchFile(Utils.scala:326)
>>
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:332)
>>
>> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$6.apply(Executor.scala:330)
>>
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>
>> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>>
>> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>> scala.

Re: Issue with Spark on EC2 using spark-ec2 script

2014-08-01 Thread Dean Wampler
It looked like you were running in standalone mode (master set to
local[4]). That's how I ran it.

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com


On Thu, Jul 31, 2014 at 8:37 PM, ratabora  wrote:

> Hey Dean! Thanks!
>
> Did you try running this on a local environment or one generated by the
> spark-ec2 script?
>
> The environment I am running on is a 4 data node 1 master spark cluster
> generated by the spark-ec2 script. I haven't modified anything in the
> environment except for adding data to the ephemeral hdfs.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Issue-with-Spark-on-EC2-using-spark-ec2-script-tp11088p7.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Can you post your whole SBT build file(s)?

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler  wrote:

> Hi,
>
> I just called:
>
> > test
>
> or
>
> > run
>
> Thorsten
>
>
> Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:
>
>  Hi,
>>
>> What is your SBT command and the parameters?
>>
>> Arthur
>>
>>
>> On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler  wrote:
>>
>>  Hello,
>>>
>>> I am writing a Spark App which is already working so far.
>>> Now I started to build also some UnitTests, but I am running into some
>>> dependecy problems and I cannot find a solution right now. Perhaps someone
>>> could help me.
>>>
>>> I build my Spark Project with SBT and it seems to be configured well,
>>> because compiling, assembling and running the built jar with spark-submit
>>> are working well.
>>>
>>> Now I started with the UnitTests, which I located under /src/test/scala.
>>>
>>> When I call "test" in sbt, I get the following:
>>>
>>> 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered
>>> BlockManager
>>> 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
>>> [trace] Stack trace suppressed: run last test:test for the full output.
>>> [error] Could not run test test.scala.SetSuite: 
>>> java.lang.NoClassDefFoundError:
>>> javax/servlet/http/HttpServletResponse
>>> [info] Run completed in 626 milliseconds.
>>> [info] Total number of tests run: 0
>>> [info] Suites: completed 0, aborted 0
>>> [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
>>> [info] All tests passed.
>>> [error] Error during tests:
>>> [error] test.scala.SetSuite
>>> [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
>>> [error] Total time: 3 s, completed 10.09.2014 12:22:06
>>>
>>> last test:test gives me the following:
>>>
>>>  last test:test
>>>>
>>> [debug] Running TaskDef(test.scala.SetSuite,
>>> org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer(
>>> HttpBroadcast.scala:156)
>>> at org.apache.spark.broadcast.HttpBroadcast$.initialize(
>>> HttpBroadcast.scala:127)
>>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
>>> HttpBroadcastFactory.scala:31)
>>> at org.apache.spark.broadcast.BroadcastManager.initialize(
>>> BroadcastManager.scala:48)
>>> at org.apache.spark.broadcast.BroadcastManager.(
>>> BroadcastManager.scala:35)
>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>> at test.scala.SetSuite.(SparkTest.scala:16)
>>>
>>> I also noticed right now, that sbt run is also not working:
>>>
>>> 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
>>> [error] (run-main-2) java.lang.NoClassDefFoundError: javax/servlet/http/
>>> HttpServletResponse
>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer(
>>> HttpBroadcast.scala:156)
>>> at org.apache.spark.broadcast.HttpBroadcast$.initialize(
>>> HttpBroadcast.scala:127)
>>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
>>> HttpBroadcastFactory.scala:31)
>>> at org.apache.spark.broadcast.BroadcastManager.initialize(
>>> BroadcastManager.scala:48)
>>> at org.apache.spark.broadcast.BroadcastManager.(
>>> BroadcastManager.scala:35)
>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>> at main.scala.PartialDuplicateScanner$.main(
>>> PartialDuplicateScanner.scala:29)
>>> at main.scala.PartialDuplicateScanner.main(
>>> PartialDuplicateScanner.scala)
>>>
>>&g

Re: Dependency Problem with Spark / ScalaTest / SBT

2014-09-14 Thread Dean Wampler
Sorry, I meant any *other* SBT files.

However, what happens if you remove the line:

exclude("org.eclipse.jetty.orbit", "javax.servlet")


dean


Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sun, Sep 14, 2014 at 11:53 AM, Dean Wampler 
wrote:

> Can you post your whole SBT build file(s)?
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Wed, Sep 10, 2014 at 6:48 AM, Thorsten Bergler 
> wrote:
>
>> Hi,
>>
>> I just called:
>>
>> > test
>>
>> or
>>
>> > run
>>
>> Thorsten
>>
>>
>> Am 10.09.2014 um 13:38 schrieb arthur.hk.c...@gmail.com:
>>
>>  Hi,
>>>
>>> What is your SBT command and the parameters?
>>>
>>> Arthur
>>>
>>>
>>> On 10 Sep, 2014, at 6:46 pm, Thorsten Bergler  wrote:
>>>
>>>  Hello,
>>>>
>>>> I am writing a Spark App which is already working so far.
>>>> Now I started to build also some UnitTests, but I am running into some
>>>> dependecy problems and I cannot find a solution right now. Perhaps someone
>>>> could help me.
>>>>
>>>> I build my Spark Project with SBT and it seems to be configured well,
>>>> because compiling, assembling and running the built jar with spark-submit
>>>> are working well.
>>>>
>>>> Now I started with the UnitTests, which I located under /src/test/scala.
>>>>
>>>> When I call "test" in sbt, I get the following:
>>>>
>>>> 14/09/10 12:22:06 INFO storage.BlockManagerMaster: Registered
>>>> BlockManager
>>>> 14/09/10 12:22:06 INFO spark.HttpServer: Starting HTTP Server
>>>> [trace] Stack trace suppressed: run last test:test for the full output.
>>>> [error] Could not run test test.scala.SetSuite: 
>>>> java.lang.NoClassDefFoundError:
>>>> javax/servlet/http/HttpServletResponse
>>>> [info] Run completed in 626 milliseconds.
>>>> [info] Total number of tests run: 0
>>>> [info] Suites: completed 0, aborted 0
>>>> [info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
>>>> [info] All tests passed.
>>>> [error] Error during tests:
>>>> [error] test.scala.SetSuite
>>>> [error] (test:test) sbt.TestsFailedException: Tests unsuccessful
>>>> [error] Total time: 3 s, completed 10.09.2014 12:22:06
>>>>
>>>> last test:test gives me the following:
>>>>
>>>>  last test:test
>>>>>
>>>> [debug] Running TaskDef(test.scala.SetSuite,
>>>> org.scalatest.tools.Framework$$anon$1@6e5626c8, false, [SuiteSelector])
>>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer(
>>>> HttpBroadcast.scala:156)
>>>> at org.apache.spark.broadcast.HttpBroadcast$.initialize(
>>>> HttpBroadcast.scala:127)
>>>> at org.apache.spark.broadcast.HttpBroadcastFactory.initialize(
>>>> HttpBroadcastFactory.scala:31)
>>>> at org.apache.spark.broadcast.BroadcastManager.initialize(
>>>> BroadcastManager.scala:48)
>>>> at org.apache.spark.broadcast.BroadcastManager.(
>>>> BroadcastManager.scala:35)
>>>> at org.apache.spark.SparkEnv$.create(SparkEnv.scala:218)
>>>> at org.apache.spark.SparkContext.(SparkContext.scala:202)
>>>> at test.scala.SetSuite.(SparkTest.scala:16)
>>>>
>>>> I also noticed right now, that sbt run is also not working:
>>>>
>>>> 14/09/10 12:44:46 INFO spark.HttpServer: Starting HTTP Server
>>>> [error] (run-main-2) java.lang.NoClassDefFoundError:
>>>> javax/servlet/http/HttpServletResponse
>>>> java.lang.NoClassDefFoundError: javax/servlet/http/HttpServletResponse
>>>> at org.apache.spark.HttpServer.start(HttpServer.scala:54)
>>>> at org.apache.spark.broadcast.HttpBroadcast$.createServer(
>&

Re: SparkSQL Thriftserver in Mesos

2014-09-22 Thread Dean Wampler
The Mesos install guide says this:

"To use Mesos from Spark, you need a Spark binary package available in a
place accessible by Mesos, and a Spark driver program configured to connect
to Mesos."

For example, putting it in HDFS or copying it to each node in the same
location should do the trick.

https://spark.apache.org/docs/latest/running-on-mesos.html



Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Mon, Sep 22, 2014 at 2:35 PM, John Omernik  wrote:

> Any thoughts on this?
>
> On Sat, Sep 20, 2014 at 12:16 PM, John Omernik  wrote:
>
>> I am running the Thrift server in SparkSQL, and running it on the node I
>> compiled spark on.  When I run it, tasks only work if they landed on that
>> node, other executors started on nodes I didn't compile spark on (and thus
>> don't have the compile directory) fail.  Should spark be distributed
>> properly with the executor uri in my spark-defaults for mesos?
>>
>> Here is the error on nodes with Lost executors
>>
>> sh: 1: /opt/mapr/spark/spark-1.1.0-SNAPSHOT/sbin/spark-executor: not found
>>
>>
>>
>


Re: scala Vector vs mllib Vector

2014-10-04 Thread Dean Wampler
Briefly, MLlib's Vector and the concrete subclasses DenseVector and
SparkVector wrap Java arrays, which are mutable and maximize memory
efficiency. To update one of these vectors, you mutate the elements of the
underlying array. That's great for performance, but dangerous in
multithreaded programs for all the usual reasons. Scala's Vector is a
*persistent
data structure* (best to google that term...), with O(1) operations, but a
higher constant factor. Scala Vector instances are immutable, so mutating
operations return a new Vector, but the "persistent" implementation uses
structure sharing (the unchanged parts) to make efficient copies.

Also, Scala Vector isn't designed to represent sparse vectors.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sat, Oct 4, 2014 at 1:44 AM, ll  wrote:

> what are the pros/cons of each?  when should we use mllib Vector, and when
> to
> use standard scala Vector?  thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/scala-Vector-vs-mllib-Vector-tp15736.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: scala Vector vs mllib Vector

2014-10-04 Thread Dean Wampler
Spark isolates each task, so I would use the MLlib vector. I didn't mention
this, but it also integrates with Breeze, a Scala mathematics library that
you might find useful.

dean

Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com

On Sat, Oct 4, 2014 at 8:52 AM, ll  wrote:

> thanks dean.  thanks for the answer with great clarity!
>
> i'm working on an algorithm that has a weight vector W(w0, w1, .., wN).
> the
> elements of this weight vector are adjusted/updated frequently - every
> iteration of the algorithm.  how would you recommend to implement this
> vector?  what is the best practice to implement this in Scala & Spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/scala-Vector-vs-mllib-Vector-tp15736p15741.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark Project Fails to run multicore in local mode.

2015-01-08 Thread Dean Wampler
Iterator$class.foreach(Iterator.scala:727) at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765) at
> org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:765) at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at
> org.apache.spark.scheduler.Task.run(Task.scala:56) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> --
> View this message in context: Spark Project Fails to run multicore in
> local mode.
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Project-Fails-to-run-multicore-in-local-mode-tp21034.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


-- 
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
Typesafe <http://typesafe.com>
@deanwampler <http://twitter.com/deanwampler>
http://polyglotprogramming.com


Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
ting more and more
> cached. This is all fine and good, but in the meanwhile I see stages
> running on the UI that point to code which is used to load A and B. How is
> this possible? Am I misunderstanding how cached RDDs should behave?
>
> And again the general question - how can one debug such issues?
>
> 4. Shuffle on disk
> Is it true - I couldn't find it in official docs, but did see this
> mentioned in various threads - that shuffle _always_ hits disk?
> (Disregarding OS caches.) Why is this the case? Are you planning to add a
> function to do shuffle in memory or are there some intrinsic reasons for
> this to be impossible?
>
>
> Sorry again for the giant mail, and thanks for any insights!
>
> Andras
>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Spark - ready for prime time?

2014-04-10 Thread Dean Wampler
Here are several good ones:

https://www.google.com/search?q=cloudera+spark&oq=cloudera+spark&aqs=chrome..69i57j69i65l3j69i60l2.4439j0j7&sourceid=chrome&espv=2&es_sm=119&ie=UTF-8



On Thu, Apr 10, 2014 at 10:42 AM, Ian Ferreira wrote:

>  Do you have the link to the Cloudera comment?
>
> Sent from Windows Mail
>
> *From:* Dean Wampler 
> *Sent:* Thursday, April 10, 2014 7:39 AM
> *To:* Spark Users 
> *Cc:* Daniel Darabos , Andras 
> Barjak
>
> Spark has been endorsed by Cloudera as the successor to MapReduce. That
> says a lot...
>
>
> On Thu, Apr 10, 2014 at 10:11 AM, Andras Nemeth <
> andras.nem...@lynxanalytics.com> wrote:
>
>> Hello Spark Users,
>>
>> With the recent graduation of Spark to a top level project (grats, btw!),
>> maybe a well timed question. :)
>>
>> We are at the very beginning of a large scale big data project and after
>> two months of exploration work we'd like to settle on the technologies to
>> use, roll up our sleeves and start to build the system.
>>
>> Spark is one of the forerunners for our technology choice.
>>
>> My question in essence is whether it's a good idea or is Spark too
>> 'experimental' just yet to bet our lives (well, the project's life) on it.
>>
>> The benefits of choosing Spark are numerous and I guess all too obvious
>> for this audience - e.g. we love its powerful abstraction, ease of
>> development and the potential for using a single system for serving and
>> manipulating huge amount of data.
>>
>> This email aims to ask about the risks. I enlist concrete issues we've
>> encountered below, but basically my concern boils down to two philosophical
>> points:
>> I. Is it too much magic? Lots of things "just work right" in Spark and
>> it's extremely convenient and efficient when it indeed works. But should we
>> be worried that customization is hard if the built in behavior is not quite
>> right for us? Are we to expect hard to track down issues originating from
>> the black box behind the magic?
>> II. Is it mature enough? E.g. we've created a pull 
>> request<https://github.com/apache/spark/pull/181>which fixes a problem that 
>> we were very surprised no one ever stumbled upon
>> before. So that's why I'm asking: is Spark being already used in
>> professional settings? Can one already trust it being reasonably bug free
>> and reliable?
>>
>> I know I'm asking a biased audience, but that's fine, as I want to be
>> convinced. :)
>>
>> So, to the concrete issues. Sorry for the long mail, and let me know if I
>> should break this out into more threads or if there is some other way to
>> have this discussion...
>>
>> 1. Memory management
>> The general direction of these questions is whether it's possible to take
>> RDD caching related memory management more into our own hands as LRU
>> eviction is nice most of the time but can be very suboptimal in some of our
>> use cases.
>> A. Somehow prioritize cached RDDs, E.g. mark some "essential" that one
>> really wants to keep. I'm fine with going down in flames if I mark too much
>> data essential.
>> B. Memory "reflection": can you pragmatically get the memory size of a
>> cached rdd and memory sizes available in total/per executor? If we could do
>> this we could indirectly avoid automatic evictions of things we might
>> really want to keep in memory.
>> C. Evictions caused by RDD partitions on the driver. I had a setup with
>> huge worker memory and smallish memory on the driver JVM. To my surprise,
>> the system started to cache RDD partitions on the driver as well. As the
>> driver ran out of memory I started to see evictions while there were still
>> plenty of space on workers. This resulted in lengthy recomputations. Can
>> this be avoided somehow?
>> D. Broadcasts. Is it possible to get rid of a broadcast manually, without
>> waiting for the LRU eviction taking care of it? Can you tell the size of a
>> broadcast programmatically?
>>
>>
>> 2. Akka lost connections
>> We have quite often experienced lost executors due to akka exceptions -
>> mostly connection lost or similar. It seems to happen when an executor gets
>> extremely busy with some CPU intensive work. Our hypothesis is that akka
>> network threads get starved and the executor fails to respond within
>> timeout limits. Is this plausible? If yes, what can we do with it?
>>
>> In general, these are scary errors in the sense that they come from t

Re: Using Spark for Divide-and-Conquer Algorithms

2014-04-11 Thread Dean Wampler
There is a handy parallelize method for running independent computations.
The examples page (http://spark.apache.org/examples.html) on the website
uses it to estimate Pi. You can join the results at the end of the parallel
calculations.


On Fri, Apr 11, 2014 at 7:52 AM, Yanzhe Chen  wrote:

>  Hi all,
>
> Is Spark suitable for applications like Convex Hull algorithm, which has
> some classic divide-and-conquer approaches like QuickHull?
>
> More generally, Is there a way to express divide-and-conquer algorithms in
> Spark?
>
> Thanks!
>
> --
> Yanzhe Chen
> Institute of Parallel and Distributed Systems
> Shanghai Jiao Tong University
> Email: yanzhe...@gmail.com
> Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Hybrid GPU CPU computation

2014-04-11 Thread Dean Wampler
I've thought about this idea, although I haven't tried it, but I think the
right approach is to pick your granularity boundary and use Spark + JVM for
large-scale parts of the algorithm, then use the gpgus API for number
crunching large chunks at a time. No need to run the JVM and Spark on the
GPU, which would make no sense anyway.

Here's another approach:
http://www.cakesolutions.net/teamblogs/2013/02/13/akka-and-cuda/

dean


On Fri, Apr 11, 2014 at 7:49 AM, Saurabh Jha wrote:

> There is a scala implementation for gpgus (nvidia cuda to be precise). but
> you also need to port mesos for gpu's. I am not sure about mesos. Also, the
> current scala gpu version is not stable to be used commercially.
>
> Hope this helps.
>
> Thanks
> saurabh.
>
>
>
> *Saurabh Jha*
> Intl. Exchange Student
> School of Computing Engineering
> Nanyang Technological University,
> Singapore
> Web: http://profile.saurabhjha.in
> Mob: +65 94663172
>
>
> On Fri, Apr 11, 2014 at 8:40 PM, Pascal Voitot Dev <
> pascal.voitot@gmail.com> wrote:
>
>> This is a bit crazy :)
>> I suppose you would have to run Java code on the GPU!
>> I heard there are some funny projects to do that...
>>
>> Pascal
>>
>> On Fri, Apr 11, 2014 at 2:38 PM, Jaonary Rabarisoa wrote:
>>
>>> Hi all,
>>>
>>> I'm just wondering if hybrid GPU/CPU computation is something that is
>>> feasible with spark ? And what should be the best way to do it.
>>>
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>
>>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Do I need to learn Scala for spark ?

2014-04-21 Thread Dean Wampler
I'm doing a talk this week at the Philly ETE conference on Spark. I'll
compare the Hadoop Java API and Spark Scala API for implemented the *inverted
index* algorithm. I'm going to make the case that the Spark API, like many
functional-programming APIs, is so powerful that it's well worth your time
to learn it. The Spark Java API certainly strives for this goal, too.

Another nice thing about the Scala API, which is also true of the APIs for
Scalding and Summingbird, is that you don't have to learn a lot of Scala to
use them (at least for a while...).

I'll post a link to the slides afterwards.

Dean Wampler
Typesafe, Inc.
@deanwampler


On Mon, Apr 21, 2014 at 8:46 AM, Pulasthi Supun Wickramasinghe <
pulasthi...@gmail.com> wrote:

> Hi,
>
> I think you can do just fine with your Java knowledge. There is a Java API
> that you can use [1]. I am also new to Spark and i have got around with
> just my Java knowledge. And Scala is easy to learn if you are good with
> Java.
>
> [1] http://spark.apache.org/docs/latest/java-programming-guide.html
>
> Regards,
> Pulasthi
>
>
> On Mon, Apr 21, 2014 at 7:10 PM, arpan57  wrote:
>
>> Hi guys,
>>  I read Spark is pretty faster than Hadoop and that inspires me to learn
>> it.
>>  I've hands on exp. with Hadoop (MR-1). And pretty good with java
>> programming.
>>  Do I need to learn Scala in order to learn Spark ?
>>  Can I go ahead and write my jobs in Java and run on spark ?
>>  How much dependency is there on Scala to learn spark ?
>>
>> Thanks in advance.
>>
>> Regards,
>> Arpan
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Do-I-need-to-learn-Scala-for-spark-tp4528.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> --
> Pulasthi Supun
> Undergraduate
> Dpt of Computer Science & Engineering
> University of Moratuwa
> Blog : http://pulasthisupun.blogspot.com/
> Git hub profile: 
> <http://pulasthisupun.blogspot.com/>https://github.com/pulasthi
> <https://github.com/pulasthi>
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: K-means with large K

2014-04-28 Thread Dean Wampler
You might also investigate other clustering algorithms, such as canopy
clustering and nearest neighbors. Some of them are less accurate, but more
computationally efficient. Often they are used to compute approximate
clusters followed by k-means (or a variant thereof) for greater accuracy.

dean


On Mon, Apr 28, 2014 at 11:41 AM, Buttler, David  wrote:

>  One thing I have used this for was to create codebooks for SIFT features
> in images.  It is a common, though fairly naïve, method for converting high
> dimensional features into a simple word-like features.  Thus, if you have
> 200 SIFT features for an image, you can reduce that to 200 ‘words’ that can
> be directly compared across your entire image set.  The drawback is that
> usually some parts of the feature space are much more dense than other
> parts and distinctive features could be lost.  You can try to minimize that
> by increasing K, but there are diminishing returns.  If you measure the
> quality of your clusters in such situations, you will find that the quality
> levels off between 1000 and 4000 clusters (at least it did for my SIFT
> feature set, YMMV on other data sets).
>
>
>
> Dave
>
>
>
> *From:* Chester Chen [mailto:chesterxgc...@yahoo.com]
> *Sent:* Monday, April 28, 2014 9:31 AM
> *To:* user@spark.apache.org
> *Cc:* user@spark.apache.org
> *Subject:* Re: K-means with large K
>
>
>
> David,
>
>   Just curious to know what kind of use cases demand such large k clusters
>
>
>
> Chester
>
> Sent from my iPhone
>
>
> On Apr 28, 2014, at 9:19 AM, "Buttler, David"  wrote:
>
>  Hi,
>
> I am trying to run the K-means code in mllib, and it works very nicely
> with small K (less than 1000).  However, when I try for a larger K (I am
> looking for 2000-4000 clusters), it seems like the code gets part way
> through (perhaps just the initialization step) and freezes.  The compute
> nodes stop doing any CPU / network / IO and nothing happens for hours.  I
> had done something similar back in the days of Spark 0.6, and I didn’t have
> any trouble going up to 4000 clusters with similar data.
>
>
>
> This happens with both a standalone cluster, and in local multi-core mode
> (with the node given 200GB of heap), but eventually completes in local
> single-core mode.
>
>
>
> Data statistics:
>
> Rows: 166248
>
> Columns: 108
>
>
>
> This is a test run before trying it out on much larger data
>
>
>
> Any ideas on what might be the cause of this?
>
>
>
> Thanks,
>
> Dave
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


My talk on "Spark: The Next Top (Compute) Model"

2014-04-30 Thread Dean Wampler
I meant to post this last week, but this is a talk I gave at the Philly ETE
conf. last week:

http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model

Also here:

http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf

dean

-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Dean Wampler
Thanks for the clarification. I'll fix the slide. I've done a lot of
Scalding/Cascading programming where the two concepts are synonymous, but
clearly I was imposing my prejudices here ;)

dean


On Thu, May 1, 2014 at 8:18 AM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

> Cool intro, thanks! One question. On slide 23 it says "Standalone ("local"
> mode)". That sounds a bit confusing without hearing the talk.
>
> Standalone mode is not local. It just does not depend on a cluster
> software. I think it's the best mode for EC2/GCE, because they provide a
> distributed filesystem anyway (S3/GCS). Why configure Hadoop if you don't
> have to.
>
>
> On Thu, May 1, 2014 at 12:25 AM, Dean Wampler wrote:
>
>> I meant to post this last week, but this is a talk I gave at the Philly
>> ETE conf. last week:
>>
>> http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model
>>
>> Also here:
>>
>> http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf
>>
>> dean
>>
>> --
>> Dean Wampler, Ph.D.
>> Typesafe
>> @deanwampler
>> http://typesafe.com
>> http://polyglotprogramming.com
>>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Dean Wampler
That's great! Thanks. Let me know if it works ;) or what I could improve to
make it work.

dean


On Thu, May 1, 2014 at 8:45 AM, ZhangYi  wrote:

>  Very Useful material. Currently, I am trying to persuade my client choose
> Spark instead of Hadoop MapReduce. Your slide give me more evidence to
> support my opinion.
>
> --
> ZhangYi (张逸)
> Developer
> tel: 15023157626
> blog: agiledon.github.com
> weibo: tw张逸
> Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
>
> On Thursday, May 1, 2014 at 9:18 PM, Daniel Darabos wrote:
>
> Cool intro, thanks! One question. On slide 23 it says "Standalone ("local"
> mode)". That sounds a bit confusing without hearing the talk.
>
> Standalone mode is not local. It just does not depend on a cluster
> software. I think it's the best mode for EC2/GCE, because they provide a
> distributed filesystem anyway (S3/GCS). Why configure Hadoop if you don't
> have to.
>
>
> On Thu, May 1, 2014 at 12:25 AM, Dean Wampler wrote:
>
>  I meant to post this last week, but this is a talk I gave at the Philly
> ETE conf. last week:
>
> http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model
>
> Also here:
>
> http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf
>
> dean
>
> --
> Dean Wampler, Ph.D.
> Typesafe
> @deanwampler
> http://typesafe.com
> http://polyglotprogramming.com
>
>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: My talk on "Spark: The Next Top (Compute) Model"

2014-05-01 Thread Dean Wampler
I updated the uploads at both locations to fix slide 23. Thanks for the
feedback.

dean


On Thu, May 1, 2014 at 9:25 AM, diplomatic Guru wrote:

> Thanks Dean, very useful indeed!
>
> Best regards,
>
> Raj
>
>
> On 1 May 2014 14:46, Dean Wampler  wrote:
>
>> That's great! Thanks. Let me know if it works ;) or what I could improve
>> to make it work.
>>
>> dean
>>
>>
>> On Thu, May 1, 2014 at 8:45 AM, ZhangYi  wrote:
>>
>>>  Very Useful material. Currently, I am trying to persuade my client
>>> choose Spark instead of Hadoop MapReduce. Your slide give me more evidence
>>> to support my opinion.
>>>
>>> --
>>> ZhangYi (张逸)
>>> Developer
>>> tel: 15023157626
>>> blog: agiledon.github.com
>>> weibo: tw张逸
>>> Sent with Sparrow <http://www.sparrowmailapp.com/?sig>
>>>
>>> On Thursday, May 1, 2014 at 9:18 PM, Daniel Darabos wrote:
>>>
>>> Cool intro, thanks! One question. On slide 23 it says "Standalone
>>> ("local" mode)". That sounds a bit confusing without hearing the talk.
>>>
>>> Standalone mode is not local. It just does not depend on a cluster
>>> software. I think it's the best mode for EC2/GCE, because they provide a
>>> distributed filesystem anyway (S3/GCS). Why configure Hadoop if you don't
>>> have to.
>>>
>>>
>>> On Thu, May 1, 2014 at 12:25 AM, Dean Wampler wrote:
>>>
>>>  I meant to post this last week, but this is a talk I gave at the Philly
>>> ETE conf. last week:
>>>
>>> http://www.slideshare.net/deanwampler/spark-the-next-top-compute-model
>>>
>>> Also here:
>>>
>>> http://polyglotprogramming.com/papers/Spark-TheNextTopComputeModel.pdf
>>>
>>> dean
>>>
>>> --
>>> Dean Wampler, Ph.D.
>>> Typesafe
>>> @deanwampler
>>> http://typesafe.com
>>> http://polyglotprogramming.com
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Dean Wampler, Ph.D.
>> Typesafe
>> @deanwampler
>> http://typesafe.com
>> http://polyglotprogramming.com
>>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Spark Training

2014-05-01 Thread Dean Wampler
I'm working on a 1-day workshop that I'm giving in Australia next week and
a few other conferences later in the year. I'll post a link when it's ready.

dean


On Thu, May 1, 2014 at 10:30 AM, Denny Lee  wrote:

> You may also want to check out Paco Nathan's Introduction to Spark
> courses: http://liber118.com/pxn/
>
>
>
> On May 1, 2014, at 8:20 AM, Mayur Rustagi  wrote:
>
> Hi Nicholas,
> We provide training on spark, hands-on also associated ecosystem.
> We gave it recently at a conference in Santa Clara. Primarily its
> targetted to novices in Spark ecosystem, to introduce them & hands on to
> get them to write simple codes & also queries on Shark.
> I think Cloudera also has one which is for Spark, Streaming & MLLib.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi <https://twitter.com/mayur_rustagi>
>
>
>
> On Thu, May 1, 2014 at 8:42 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> There are many freely-available resources for the enterprising individual
>> to use if they want to Spark up their life.
>>
>> For others, some structured training is in order. Say I want everyone
>> from my department at my company to get something like the AMP 
>> Camp<http://ampcamp.berkeley.edu/>experience, perhaps on-site.
>>
>> What are my options for that?
>>
>> Databricks doesn't have a contact page, so I figured this would be the
>> next best place to ask.
>>
>> Nick
>>
>>
>> --
>> View this message in context: Spark 
>> Training<http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Training-tp5166.html>
>> Sent from the Apache Spark User List mailing list 
>> archive<http://apache-spark-user-list.1001560.n3.nabble.com/>at
>> Nabble.com.
>>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Spark and Java 8

2014-05-06 Thread Dean Wampler
Cloudera customers will need to put pressure on them to support Java 8.
They only officially supported Java 7 when Oracle stopped supporting Java 6.

dean


On Wed, May 7, 2014 at 5:05 AM, Matei Zaharia wrote:

> Java 8 support is a feature in Spark, but vendors need to decide for
> themselves when they’d like support Java 8 commercially. You can still run
> Spark on Java 7 or 6 without taking advantage of the new features (indeed
> our builds are always against Java 6).
>
> Matei
>
> On May 6, 2014, at 8:59 AM, Ian O'Connell  wrote:
>
> I think the distinction there might be they never said they ran that code
> under CDH5, just that spark supports it and spark runs under CDH5. Not that
> you can use these features while running under CDH5.
>
> They could use mesos or the standalone scheduler to run them
>
>
> On Tue, May 6, 2014 at 6:16 AM, Kristoffer Sjögren wrote:
>
>> Hi
>>
>> I just read an article [1] about Spark, CDH5 and Java 8 but did not get
>> exactly how Spark can run Java 8 on a YARN cluster at runtime. Is Spark
>> using a separate JVM that run on data nodes or is it reusing the YARN JVM
>> runtime somehow, like hadoop1?
>>
>> CDH5 only supports Java 7 [2] as far as I know?
>>
>> Cheers,
>> -Kristoffer
>>
>>
>> [1]
>> http://blog.cloudera.com/blog/2014/04/making-apache-spark-easier-to-use-in-java-with-java-8/
>> [2]
>> http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Requirements-and-Supported-Versions/CDH5-Requirements-and-Supported-Versions.html
>>
>>
>>
>>
>>
>
>


-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


Re: Announcing Spark 1.0.0

2014-05-30 Thread Dean Wampler
Congratulations!!


On Fri, May 30, 2014 at 5:12 AM, Patrick Wendell  wrote:

> I'm thrilled to announce the availability of Spark 1.0.0! Spark 1.0.0
> is a milestone release as the first in the 1.0 line of releases,
> providing API stability for Spark's core interfaces.
>
> Spark 1.0.0 is Spark's largest release ever, with contributions from
> 117 developers. I'd like to thank everyone involved in this release -
> it was truly a community effort with fixes, features, and
> optimizations contributed from dozens of organizations.
>
> This release expands Spark's standard libraries, introducing a new SQL
> package (SparkSQL) which lets users integrate SQL queries into
> existing Spark workflows. MLlib, Spark's machine learning library, is
> expanded with sparse vector support and several new algorithms. The
> GraphX and Streaming libraries also introduce new features and
> optimizations. Spark's core engine adds support for secured YARN
> clusters, a unified tool for submitting Spark applications, and
> several performance and stability improvements. Finally, Spark adds
> support for Java 8 lambda syntax and improves coverage of the Java and
> Python API's.
>
> Those features only scratch the surface - check out the release notes here:
> http://spark.apache.org/releases/spark-release-1-0-0.html
>
> Note that since release artifacts were posted recently, certain
> mirrors may not have working downloads for a few hours.
>
> - Patrick
>



-- 
Dean Wampler, Ph.D.
Typesafe
@deanwampler
http://typesafe.com
http://polyglotprogramming.com


<    1   2