Deepak, to be sure, I was referring to sequential guarantees with the longs.
I would suggest being careful with taking half the UUID as the probability
of collision can be unexpectedly high. Many bits of the UUID is typically
time-based so collision among those bits is virtually guaranteed with
David, actually, it's the driver that creates and holds a reference to
the SparkContext. The master in this context is only a resource manager
providing information about the cluster, being aware of where workers are,
how many there are, etc.
The SparkContext object can get
Eran, you could try what Patrick suggested, in detail: 1. Do a full build
on a connected laptop, 2. Copy ~/.m2 and .ivy2 over, 3. Do mvn -o or sbt
set offline:=true command; if that meets your needs.
Sent while mobile. Pls excuse typos etc.
On Feb 9, 2014 12:58 PM, Patrick Wendell
Andrew, couldn't you do in the Scala code:
scala.sys.process.Process(hadoop fs -copyToLocal ...)!
or is that still considered a second step?
hadoop fs is almost certainly going to be better at copying these files
than some memory-to-disk-to-memory serdes within Spark.
--
Christopher T.
Philip, I guess the key problem statement is the large collection of
part? If so this may be helpful, at the HDFS level:
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/.
Otherwise you can always start with an RDD[fileUri] and go from there to an
RDD[(fileUri, read_contents)].
Sent
for different
types of tasks. From what you have explained, is it OK to think Shark
is better off for SQL-like tasks, while Spark is more for iterative
machine learning algorithms?
Cheers,
-chen
On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen c...@adatao.com
wrote:
Chen, interesting
David,
map() would iterate row by row, forcing an if on each row.
mapPartitions*() allows you to have a conditional on the whole partition
first, as Mark suggests. That should usually be sufficient.
SparkContext.runJob() allows you to specify which partitions to run on, if
you're sure it's
Guillaume, this is RDD.count()
/**
* Return the number of elements in the RDD.
*/
def count(): Long = {
sc.runJob(this, (iter: Iterator[T]) = {
// Use a while loop to count the number of elements rather than
iter.size because
// iter.size uses a for loop, which is
on DStream/TD's work and will be available soon.
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen c...@adatao.com wrote:
Mark, that's precisely why I brought up lineage, in order to say I didn't
want
Sai, from your question, I infer that you have an interpretation that RDDs
are somehow an in-memory/cached copy of the underlying data source---and so
there is some expectation that there is some synchronization model between
the two.
That would not be what the RDD model is. RDDs are first-class,
) Continue on with spark-shell:
scala println(lines.collect.mkString(, ))
.
.
.
and now, for something, completely, different
On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen c...@adatao.comwrote:
Sai, from your question, I infer that you have an interpretation that
RDDs are somehow
Walrus, given the question, this may be a good place for you to start.
There's some good discussion there as well as links to papers.
http://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-regularization
Sent while mobile. Pls excuse typos etc.
On Jan 8, 2014 2:24 PM,
How about this: https://github.com/apache/incubator-spark/pull/326
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Thu, Jan 2, 2014 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
I agree that it would be good to do it only once, if you
It's a reasonable ask (row indices) in some interactive use cases we've
come across. We're working on providing support for this at a higher level
of abstraction.
Sent while mobile. Pls excuse typos etc.
On Dec 31, 2013 11:34 AM, Aureliano Buendia buendia...@gmail.com wrote:
On Mon, Dec 30,
Bao, to help clarify what TD is saying: Spark launches multiple workers on
multiple threads in parallel, running the same closure code in the same JVM
on the same machine, but operating on different rows of data.
Because of this parallelism, if that worker code weren't thread-safe for
some
Bao, as described, your use case doesn't need to invoke anything like
custom RDDs or DStreams.
In a call like
val resultRdd = scripts.map(s = ScriptEngine.eval(s))
Spark will do its best to serialize/deserialize ScriptEngine to each of the
workers---if ScriptEngine is Serializable.
Now, if
Phillip, if there are easily detectable line groups you might define your
own InputFormat. Alternatively you can consider using mapPartitions() to
get access to the entire data partition instead of row-at-a-time. You'd
still have to worry about what happens at the partition boundaries. A third
Are we over-thinking the problem here? Since the per-window compute task is
hugely expensive, stateless from window to window, and the original big
matrix is just 1GB, the primary gain in using a parallel engine is in
distributing and scheduling these (long-running, isolated) tasks. I'm
reading
only need to replicate data across the boundaries of
each partition of windows, rather than each window.
How can this be written in spark scala?
On Fri, Dec 20, 2013 at 2:53 PM, Christopher Nguyen c...@adatao.comwrote:
Are we over-thinking the problem here? Since the per-window compute
MichaelY, this sort of thing where it could be any of dozens of things
can usually be resolved by asking someone share your screen with you for 5
minutes. It's far more productive than guessing over emails.
If @freeman is willing, you can send a private message to him to set that
up over Google
at 9:43 PM, Christopher Nguyen c...@adatao.comwrote:
Aureliano, how would your production data be coming in and accessed? It's
possible that you can still think of that level as a serial operation
(outer loop, large chunks) first before worrying about parallelizing the
computation of the tiny
A couple of fixes inline.
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Fri, Dec 20, 2013 at 2:34 PM, Christopher Nguyen c...@adatao.com wrote:
Aureliano, would something like this work? The red code is the only place
where you have to think
, as there may
be opportunities for parallel speed-ups there.
Sent while mobile. Pls excuse typos etc.
On Dec 20, 2013 2:56 PM, Aureliano Buendia buendia...@gmail.com wrote:
On Fri, Dec 20, 2013 at 10:34 PM, Christopher Nguyen c...@adatao.comwrote:
Aureliano, would something like this work
, and
re-persist as an RDD?
On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen c...@adatao.comwrote:
Kyle, the fundamental contract of a Spark RDD is that it is immutable.
This follows the paradigm where data is (functionally) transformed into
other data, rather than mutated. This allows
' way to manage a distributed data set, which
would then serve as an input to Spark RDDs?
Kyle
On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen c...@adatao.comwrote:
Kyle, the fundamental contract of a Spark RDD is that it is immutable.
This follows the paradigm where data
Kyle, the fundamental contract of a Spark RDD is that it is immutable. This
follows the paradigm where data is (functionally) transformed into other
data, rather than mutated. This allows these systems to make certain
assumptions and guarantees that otherwise they wouldn't be able to.
Now we've
Shay, we've done this at Adatao, specifically a big data frame in RDD
representation and subsetting/projections/data mining/machine learning
algorithms on that in-memory table structure.
We're planning to harmonize that with the MLBase work in the near future.
Just a matter of prioritization on
Grega, the way to think about this setting is that it sets the maximum
amount of memory Spark is allowed to use for caching RDDs before it must
expire or spill them to disk. Spark in principle knows at all times how
many RDDs are kept in memory and their total sizes, so it can for example
persist
Matt, it would be useful to back up one level to your problem statement. If
it is strictly restricted as described, then you have a sequential problem
that's not parallelizable. What is the primary design goal here? To
complete the operation in the shortest time possible (big compute)? Or to
be
For better precision,
s/Or to be able to handle very large data sets (big memory)/Or to be able
to hold very large data sets in one place (big memory)/g
--
Christopher T. Nguyen
Co-founder CEO, Adatao http://adatao.com
linkedin.com/in/ctnguyen
On Tue, Oct 22, 2013 at 2:16 PM, Christopher
Ramkumar, it sounds like you can consider a file-parallel approach rather
than a strict data-parallel parsing of the problem. In other words,
separate the file copying task from the file parsing task. Have the driver
program D handle the directory scan, which then parallelizes the file list
into N
31 matches
Mail list logo