Re: Is there anyway Spark UI is set to poll and refreshes itself

2016-08-25 Thread Marek Wiewiorka
Hi you can take a look at:
https://github.com/hammerlab/spree

it's a bit outdated but maybe it's still possible to use with some more
recent Spark version.

M.

2016-08-25 11:55 GMT+02:00 Mich Talebzadeh :

> Hi,
>
> This may be already there.
>
> A spark job opens up a UI on port specified by --conf
> "spark.ui.port=${SP}"  that defaults to 4040.
>
> However, on UI one needs to refresh the page to see the progress.
>
> Can this be polled so it is refreshed automatically
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


PCA slow in comparison with single-threaded R version

2017-02-06 Thread Marek Wiewiorka
Hi All,
I hit performance issues with running PCA for matrix with greater number of
features (2.5k x 15k):

import org.apache.spark.mllib.linalg.Matrix
import org.apache.spark.mllib.linalg.distributed.RowMatrix
import org.apache.spark.mllib.linalg.DenseVector
import org.apache.spark.mllib.linalg.Vectors

val sampleCnt = 2504
val featureCnt = 15000
 val gen = sc.parallelize( (1 to sampleCnt).map{r=>val rnd = new
scala.util.Random(); Vectors.dense ((1 to
featureCnt).map(k=>rnd.nextInt(2).toDouble).toArray ) } )
val rowMat = new RowMatrix(gen)
val pc: Matrix = rowMat.computePrincipalComponents(10)

I'm running the above code on standalone Spark cluster of 4 nodes and
128cores in total.
According to what I observed there is a final stage of the algorithm that
is executed on the Driver using 1 thread that seems to be a bottleneck here
- is there any way of tuning it?

It takes ages (actually I was forced to kill it after 30min or so)
whereas the same code written in R executes in ~6.5  minutes on my
laptop(1-thread):
> a<-replicate(2504, rnorm(5000))
> nrow(a)
[1] 5000
> ncol(a)
[1] 2504
> system.time(b<-prcomp(a))
   user  system elapsed
190.284   0.392 191.150
> a<-replicate(2504, rnorm(15000))
> system.time(b<-prcomp(a))
   user  system elapsed
386.520   0.384 386.933

I've compiled Spark with the support for native matrix libs uisng
-Pnetlib-lgpl switch.

Has anyone experienced such problems with mllib version of PCA?

Thanks,
Marek


---cores option in spark-shell

2014-06-03 Thread Marek Wiewiorka
Hi All,
there is information in 1.0.0 Spark's documentation that
there is an option "--cores" that one can use to set the number of cores
that spark-shell uses on the cluster:

You can also pass an option --cores  to control the number of
cores that spark-shell uses on the cluster.

This option does not seem to work for me.
If run the following command:
./spark-shell --cores 12
I'm keep on getting an error:
bad option: '--cores'

Is there any other way of controlling the total number of cores used by
sparkshell?

Thanks,
Marek


Re: ---cores option in spark-shell

2014-06-03 Thread Marek Wiewiorka
That used to work with version 0.9.1 and earlier and does not seem to work
with 1.0.0.
M.




2014-06-03 17:53 GMT+02:00 Mikhail Strebkov :

> Try -c  instead, works for me, e.g.
>
> bin/spark-shell -c 88
>
>
>
> On Tue, Jun 3, 2014 at 8:15 AM, Marek Wiewiorka  > wrote:
>
>> Hi All,
>> there is information in 1.0.0 Spark's documentation that
>> there is an option "--cores" that one can use to set the number of cores
>> that spark-shell uses on the cluster:
>>
>> You can also pass an option --cores  to control the number of
>> cores that spark-shell uses on the cluster.
>>
>> This option does not seem to work for me.
>> If run the following command:
>> ./spark-shell --cores 12
>> I'm keep on getting an error:
>> bad option: '--cores'
>>
>> Is there any other way of controlling the total number of cores used by
>> sparkshell?
>>
>> Thanks,
>> Marek
>>
>>
>


Spark 1.0.0 fails if mesos.coarse set to true

2014-06-03 Thread Marek Wiewiorka
Hi All,
I'm trying to run my code that used to work with mesos-0.14 and spark-0.9.0
with mesos-0.18.2 and spark-1.0.0. and I'm getting a weird error when I use
coarse mode (see below).
If I use the fine-grained mode everything is ok.
Has anybody of you experienced a similar error?

more stderr
---

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave
201405220917-134217738-5050-27119-0
Exception in thread "main" java.lang.NumberFormatException: For input
string: "sparkseq003.cloudapp.net"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:492)
at java.lang.Integer.parseInt(Integer.java:527)
at
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
at
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)

more stdout
---
Registered executor on sparkseq003.cloudapp.net
Starting task 5
Forked command at 61202
sh -c '"/home/mesos/spark-1.0.0/bin/spark-class"
org.apache.spark.executor.CoarseGrainedExecutorBackend
-Dspark.mesos.coarse=true akka.tcp://
sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net
4'
Command exited with status 1 (pid: 61202)

Many thanks,
Marek


Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-03 Thread Marek Wiewiorka
Hi All,
I've been experiencing a very strange error after upgrade from Spark 0.9 to
1.0 - it seems that saveAsTestFile function is throwing
java.lang.UnsupportedOperationException that I have never seen before.
Any hints appreciated.

scheduler.TaskSetManager: Loss was due to java.lang.ClassNotFoundException:
org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 [duplicate 45]
14/06/03 16:46:23 ERROR actor.OneForOneStrategy:
java.lang.UnsupportedOperationException
at
org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
at
org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
at
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at org.apache.spark.scheduler.DAGScheduler.org

$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
at
org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
at
org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499)
at
org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151)
at
org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147)
at
akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
at
akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
at akka.dispatch.Mailbox.run(Mailbox.scala:218)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

Thanks,
Marek


Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-03 Thread Marek Wiewiorka
yes, I have - I compiled both Spark and my soft from sources - actually the
whole processing is executing fine - just saving results is failing.



2014-06-03 21:01 GMT+02:00 Gerard Maas :

> Have you tried re-compiling your job against the 1.0 release?
>
>
> On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka  > wrote:
>
>> Hi All,
>> I've been experiencing a very strange error after upgrade from Spark 0.9
>> to 1.0 - it seems that saveAsTestFile function is throwing
>> java.lang.UnsupportedOperationException that I have never seen before.
>> Any hints appreciated.
>>
>> scheduler.TaskSetManager: Loss was due to
>> java.lang.ClassNotFoundException:
>> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 [duplicate 45]
>> 14/06/03 16:46:23 ERROR actor.OneForOneStrategy:
>> java.lang.UnsupportedOperationException
>> at
>> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
>> at
>> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>> at org.apache.spark.scheduler.DAGScheduler.org
>> <http://org.apache.spark.scheduler.dagscheduler.org/>
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGScheduler.scala:998)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply$mcVI$sp(DAGScheduler.scala:499)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$doCancelAllJobs$1.apply(DAGScheduler.scala:499)
>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>> at
>> org.apache.spark.scheduler.DAGScheduler.doCancelAllJobs(DAGScheduler.scala:499)
>> at
>> org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1151)
>> at
>> org.apache.spark.scheduler.DAGSchedulerActorSupervisor$$anonfun$2.applyOrElse(DAGScheduler.scala:1147)
>> at
>> akka.actor.SupervisorStrategy.handleFailure(FaultHandling.scala:295)
>> at
>> akka.actor.dungeon.FaultHandling$class.handleFailure(FaultHandling.scala:253)
>> at akka.actor.ActorCell.handleFailure(ActorCell.scala:338)
>> at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:423)
>> at akka.actor.ActorCell.systemInvoke(ActorCell.scala:447)
>> at
>> akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:262)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:218)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>> Thanks,
>> Marek
>>
>
>


Re: Spark 1.0.0 fails if mesos.coarse set to true

2014-06-04 Thread Marek Wiewiorka
Exactly the same story - it used to work with 0.9.1 and does not work
anymore with 1.0.0.
I ran tests using spark-shell as well as my application(so tested turning
on coarse mode via env variable and  SparkContext properties explicitly)

M.


2014-06-04 18:12 GMT+02:00 ajatix :

> I'm running a manually built cluster on EC2. I have mesos (0.18.2) and hdfs
> (2.0.0-cdh4.5.0) installed on all slaves (3) and masters (3). I have
> spark-1.0.0 on one master and the executor file is on hdfs for the slaves.
> Whenever I try to launch a spark application on the cluster, it starts a
> task on each slave (i'm using default configs) and they start FAILING with
> the error msg - 'Is spark installed on it?'
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-0-fails-if-mesos-coarse-set-to-true-tp6817p6945.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Strange problem with saveAsTextFile after upgrade Spark 0.9.0->1.0.0

2014-06-04 Thread Marek Wiewiorka
No, it's a Scala application. Unfortunately after I came across problems
with running using mesos coarse mode and this issue I decided to do the
downgrade to Spark 0.9.1
and purged logs.But I as far as I can remember I tried to run my app using
Spark standalone mode there was also the same ClassNotFoundException
reported.

M.


2014-06-04 18:23 GMT+02:00 Mark Hamstra :

> Actually, what the stack trace is showing is the result of an exception
> being thrown by the DAGScheduler's event processing actor.  What happens is
> that the Supervisor tries to shut down Spark when an exception is thrown by
> that actor.  As part of the shutdown procedure, the DAGScheduler tries to
> cancel any jobs running on the cluster, but the scheduler backend for Mesos
> doesn't yet implement killTask, so the shutdown procedure fails with an
> UnsupportedOperationException.
>
> In other words, the stack trace is all about failure to cleanly shut down
> in response to some prior failure.  What that prior, root-cause failure
> actually was is not clear to me from the stack trace or bug report, but at
> least the failure to shut down should be fixed in Spark 1.0.1 after PR 686
> <https://github.com/apache/spark/pull/686> is merged.
>
> Was this an application created with the Python API?  There have been some
> similar bug reports associated with Python applications, but I'm not sure
> at this point that the problem actually resides in PySpark.
>
>
> On Wed, Jun 4, 2014 at 8:38 AM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>>
>> On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka <
>> marek.wiewio...@gmail.com> wrote:
>>
>>> Hi All,
>>> I've been experiencing a very strange error after upgrade from Spark 0.9
>>> to 1.0 - it seems that saveAsTestFile function is throwing
>>> java.lang.UnsupportedOperationException that I have never seen before.
>>>
>>
>> In the stack trace you quoted, saveAsTextFile is not called. Is it really
>> throwing an exception? Do you have the stack trace from the executor
>> process? I think the exception originates from there, and the scheduler is
>> just reporting it here.
>>
>>
>>> Any hints appreciated.
>>>
>>> scheduler.TaskSetManager: Loss was due to
>>> java.lang.ClassNotFoundException:
>>> org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1 [duplicate 45]
>>> 14/06/03 16:46:23 ERROR actor.OneForOneStrategy:
>>> java.lang.UnsupportedOperationException
>>> at
>>> org.apache.spark.scheduler.SchedulerBackend$class.killTask(SchedulerBackend.scala:32)
>>> at
>>> org.apache.spark.scheduler.cluster.mesos.MesosSchedulerBackend.killTask(MesosSchedulerBackend.scala:41)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply$mcVJ$sp(TaskSchedulerImpl.scala:185)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3$$anonfun$apply$1.apply(TaskSchedulerImpl.scala:183)
>>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:183)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$cancelTasks$3.apply(TaskSchedulerImpl.scala:176)
>>> at scala.Option.foreach(Option.scala:236)
>>> at
>>> org.apache.spark.scheduler.TaskSchedulerImpl.cancelTasks(TaskSchedulerImpl.scala:176)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply$mcVI$sp(DAGScheduler.scala:1058)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages$1.apply(DAGScheduler.scala:1045)
>>> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> <http://org.apache.spark.scheduler.dagscheduler.org/>
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1045)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler.handleJobCancellation(DAGSched

Re: cache spark sql parquet file in memory?

2014-06-07 Thread Marek Wiewiorka
I was also thinking of using tachyon to store parquet files - maybe
tomorrow I will give a try as well.


2014-06-07 20:01 GMT+02:00 Michael Armbrust :

> Not a stupid question!  I would like to be able to do this.  For now, you
> might try writing the data to tachyon 
> instead of HDFS.  This is untested though, please report any issues you run
> into.
>
> Michael
>
>
> On Fri, Jun 6, 2014 at 8:13 PM, Xu (Simon) Chen  wrote:
>
>> This might be a stupid question... but it seems that saveAsParquetFile()
>> writes everything back to HDFS. I am wondering if it is possible to cache
>> parquet-format intermediate results in memory, and therefore making spark
>> sql queries faster.
>>
>> Thanks.
>> -Simon
>>
>
>


Re: Using Spark to crack passwords

2014-06-11 Thread Marek Wiewiorka
What about rainbow tables?
http://en.wikipedia.org/wiki/Rainbow_table

M.


2014-06-12 2:41 GMT+02:00 DB Tsai :

> I think creating the samples in the search space within RDD will be
> too expensive, and the amount of data will probably be larger than any
> cluster.
>
> However, you could create a RDD of searching ranges, and each range
> will be searched by one map operation. As a result, in this design,
> the # of row in RDD will be the same as the # of executors, and we can
> use mapPartition to loop through all the sample in the range without
> actually storing them in RDD.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Wed, Jun 11, 2014 at 5:24 PM, Nick Chammas
>  wrote:
> > Spark is obviously well-suited to crunching massive amounts of data. How
> > about to crunch massive amounts of numbers?
> >
> > A few years ago I put together a little demo for some co-workers to
> > demonstrate the dangers of using SHA1 to hash and store passwords. Part
> of
> > the demo included a live brute-forcing of hashes to show how SHA1's speed
> > made it unsuitable for hashing passwords.
> >
> > I think it would be cool to redo the demo, but utilize the power of a
> > cluster managed by Spark to crunch through hashes even faster.
> >
> > But how would you do that with Spark (if at all)?
> >
> > I'm guessing you would create an RDD that somehow defined the search
> space
> > you're going to go through, and then partition it to divide the work up
> > equally amongst the cluster's cores. Does that sound right?
> >
> > I wonder if others have already used Spark for computationally-intensive
> > workloads like this, as opposed to just data-intensive ones.
> >
> > Nick
> >
> >
> > 
> > View this message in context: Using Spark to crack passwords
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Using Spark to crack passwords

2014-06-12 Thread Marek Wiewiorka
This actually what I've already mentioned -  with rainbow tables kept in
memory it could be really fast!

Marek


2014-06-12 9:25 GMT+02:00 Michael Cutler :

> Hi Nick,
>
> The great thing about any *unsalted* hashes is you can precompute them
> ahead of time, then it is just a lookup to find the password which matches
> the hash in seconds -- always makes for a more exciting demo than "come
> back in a few hours".
>
> It is a no-brainer to write a generator function to create all possible
> passwords from a charset like "
> abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789", hash
> them and store them to lookup later.  It is however incredibly wasteful on
> storage space.
>
> - all passwords from 1 to 9 letters long
> - using the charset above = 13,759,005,997,841,642 passwords
> - assuming 20 bytes to store the SHA-1 and up to 9 to store the password
>  equals approximately 375.4 Petabytes
>
> Thankfully there is already a more efficient/compact mechanism to achieve
> this using Rainbow Tables  --
> better still, there is an active community of people who have already
> precomputed many of these datasets already.  The above dataset is readily
> available to download and is just 864GB -- much more feasible.
>
> All you need to do then is write a rainbow-table lookup function in Spark
> and leverage the precomputed files stored in HDFS.  Done right you should
> be able to achieve interactive (few second) lookups.
>
> Have fun!
>
> MC
>
>
>  *Michael Cutler*
> Founder, CTO
>
>
> * Mobile: +44 789 990 7847 Email:   mich...@tumra.com 
> Web: tumra.com
>  *
> *Visit us at our offices in Chiswick Park *
> *Registered in England & Wales, 07916412. VAT No. 130595328 <130595328>*
>
>
> This email and any files transmitted with it are confidential and may also
> be privileged. It is intended only for the person to whom it is addressed.
> If you have received this email in error, please inform the sender 
> immediately.
> If you are not the intended recipient you must not use, disclose, copy,
> print, distribute or rely on this email.
>
>
> On 12 June 2014 01:24, Nick Chammas  wrote:
>
>> Spark is obviously well-suited to crunching massive amounts of data. How
>> about to crunch massive amounts of numbers?
>>
>> A few years ago I put together a little demo for some co-workers to
>> demonstrate the dangers of using SHA1
>>  to hash and store
>> passwords. Part of the demo included a live brute-forcing of hashes to show
>> how SHA1's speed made it unsuitable for hashing passwords.
>>
>> I think it would be cool to redo the demo, but utilize the power of a
>> cluster managed by Spark to crunch through hashes even faster.
>>
>> But how would you do that with Spark (if at all)?
>>
>> I'm guessing you would create an RDD that somehow defined the search
>> space you're going to go through, and then partition it to divide the work
>> up equally amongst the cluster's cores. Does that sound right?
>>
>> I wonder if others have already used Spark for computationally-intensive
>> workloads like this, as opposed to just data-intensive ones.
>>
>> Nick
>>
>>
>> --
>> View this message in context: Using Spark to crack passwords
>> 
>> Sent from the Apache Spark User List mailing list archive
>>  at Nabble.com.
>>
>
>


lower&upperBound not working/spark 1.3

2015-03-22 Thread Marek Wiewiorka
Hi All - I try to use the new SQLContext API for populating DataFrame from
jdbc data source.
like this:

val jdbcDF = sqlContext.jdbc(url =
"jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
"se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
1, numPartitions=12 )

No matter how I set lower and upper bounds I always get all the rows from
my table.
The API is marked as experimental so I assume there might by some bugs in
it but
did anybody come across a similar issue?

Thanks!


Re: lower&upperBound not working/spark 1.3

2015-03-22 Thread Marek Wiewiorka
...I even tried setting upper/lower bounds to the same value like 1 or 10
with the same result.
cs_id is a column of the cardinality ~5*10^6
So this is not the case here.

Regards,
Marek

2015-03-22 20:30 GMT+01:00 Ted Yu :

> From javadoc of JDBCRelation#columnPartition():
>* Given a partitioning schematic (a column of integral type, a number of
>* partitions, and upper and lower bounds on the column's value),
> generate
>
> In your example, 1 and 1 are for the value of cs_id column.
>
> Looks like all the values in that column fall within the range of 1 and
> 1000.
>
> Cheers
>
> On Sun, Mar 22, 2015 at 8:44 AM, Marek Wiewiorka <
> marek.wiewio...@gmail.com> wrote:
>
>> Hi All - I try to use the new SQLContext API for populating DataFrame
>> from jdbc data source.
>> like this:
>>
>> val jdbcDF = sqlContext.jdbc(url =
>> "jdbc:postgresql://localhost:5430/dbname?user=user&password=111", table =
>> "se_staging.exp_table3" ,columnName="cs_id",lowerBound=1 ,upperBound =
>> 1, numPartitions=12 )
>>
>> No matter how I set lower and upper bounds I always get all the rows from
>> my table.
>> The API is marked as experimental so I assume there might by some bugs in
>> it but
>> did anybody come across a similar issue?
>>
>> Thanks!
>>
>
>


Re: docker spark 1.1.0 cluster

2014-10-24 Thread Marek Wiewiorka
Hi,
here you can find some info regarding 1.0:
https://github.com/amplab/docker-scripts

Marek

2014-10-24 23:38 GMT+02:00 Josh J :

> Hi,
>
> Is there a dockerfiles available which allow to setup a docker spark 1.1.0
> cluster?
>
> Thanks,
> Josh
>