Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Christopher Nguyen
Deepak, to be sure, I was referring to sequential guarantees with the longs. I would suggest being careful with taking half the UUID as the probability of collision can be unexpectedly high. Many bits of the UUID is typically time-based so collision among those bits is virtually guaranteed with

Re: Connecting an Application to the Cluster

2014-02-17 Thread Christopher Nguyen
David, actually, it's the driver that creates and holds a reference to the SparkContext. The master in this context is only a resource manager providing information about the cluster, being aware of where workers are, how many there are, etc. The SparkContext object can get

Re: working on a closed networkworking on a closed network - any recomendations

2014-02-09 Thread Christopher Nguyen
Eran, you could try what Patrick suggested, in detail: 1. Do a full build on a connected laptop, 2. Copy ~/.m2 and .ivy2 over, 3. Do mvn -o or sbt set offline:=true command; if that meets your needs. Sent while mobile. Pls excuse typos etc. On Feb 9, 2014 12:58 PM, Patrick Wendell

Re: Stream RDD to local disk

2014-01-30 Thread Christopher Nguyen
Andrew, couldn't you do in the Scala code: scala.sys.process.Process(hadoop fs -copyToLocal ...)! or is that still considered a second step? hadoop fs is almost certainly going to be better at copying these files than some memory-to-disk-to-memory serdes within Spark. -- Christopher T.

Re: RDD[URI]

2014-01-30 Thread Christopher Nguyen
Philip, I guess the key problem statement is the large collection of part? If so this may be helpful, at the HDFS level: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Otherwise you can always start with an RDD[fileUri] and go from there to an RDD[(fileUri, read_contents)]. Sent

Re: Please Help: Amplab Benchmark Performance

2014-01-29 Thread Christopher Nguyen
for different types of tasks. From what you have explained, is it OK to think Shark is better off for SQL-like tasks, while Spark is more for iterative machine learning algorithms? Cheers, -chen On Wed, Jan 29, 2014 at 8:59 PM, Christopher Nguyen c...@adatao.com wrote: Chen, interesting

Re: RDD and Partition

2014-01-28 Thread Christopher Nguyen
David, map() would iterate row by row, forcing an if on each row. mapPartitions*() allows you to have a conditional on the whole partition first, as Mark suggests. That should usually be sufficient. SparkContext.runJob() allows you to specify which partitions to run on, if you're sure it's

Re: Forcing RDD computation with something else than count() ?

2014-01-22 Thread Christopher Nguyen
Guillaume, this is RDD.count() /** * Return the number of elements in the RDD. */ def count(): Long = { sc.runJob(this, (iter: Iterator[T]) = { // Use a while loop to count the number of elements rather than iter.size because // iter.size uses a for loop, which is

Re: Consistency between RDD's and Native File System

2014-01-17 Thread Christopher Nguyen
on DStream/TD's work and will be available soon. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen c...@adatao.com wrote: Mark, that's precisely why I brought up lineage, in order to say I didn't want

Re: Consistency between RDD's and Native File System

2014-01-16 Thread Christopher Nguyen
Sai, from your question, I infer that you have an interpretation that RDDs are somehow an in-memory/cached copy of the underlying data source---and so there is some expectation that there is some synchronization model between the two. That would not be what the RDD model is. RDDs are first-class,

Re: Consistency between RDD's and Native File System

2014-01-16 Thread Christopher Nguyen
) Continue on with spark-shell: scala println(lines.collect.mkString(, )) . . . and now, for something, completely, different On Thu, Jan 16, 2014 at 7:53 PM, Christopher Nguyen c...@adatao.comwrote: Sai, from your question, I infer that you have an interpretation that RDDs are somehow

Re: what paper is the L2 regularization based on?

2014-01-08 Thread Christopher Nguyen
Walrus, given the question, this may be a good place for you to start. There's some good discussion there as well as links to papers. http://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-regularization Sent while mobile. Pls excuse typos etc. On Jan 8, 2014 2:24 PM,

Re: Is spark-env.sh supposed to be stateless?

2014-01-03 Thread Christopher Nguyen
How about this: https://github.com/apache/incubator-spark/pull/326 -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Thu, Jan 2, 2014 at 11:07 PM, Matei Zaharia matei.zaha...@gmail.comwrote: I agree that it would be good to do it only once, if you

Re: How to map each line to (line number, line)?

2013-12-31 Thread Christopher Nguyen
It's a reasonable ask (row indices) in some interactive use cases we've come across. We're working on providing support for this at a higher level of abstraction. Sent while mobile. Pls excuse typos etc. On Dec 31, 2013 11:34 AM, Aureliano Buendia buendia...@gmail.com wrote: On Mon, Dec 30,

Re: Stateful RDD

2013-12-30 Thread Christopher Nguyen
Bao, to help clarify what TD is saying: Spark launches multiple workers on multiple threads in parallel, running the same closure code in the same JVM on the same machine, but operating on different rows of data. Because of this parallelism, if that worker code weren't thread-safe for some

Re: Stateful RDD

2013-12-27 Thread Christopher Nguyen
Bao, as described, your use case doesn't need to invoke anything like custom RDDs or DStreams. In a call like val resultRdd = scripts.map(s = ScriptEngine.eval(s)) Spark will do its best to serialize/deserialize ScriptEngine to each of the workers---if ScriptEngine is Serializable. Now, if

Re: multi-line elements

2013-12-24 Thread Christopher Nguyen
Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
Are we over-thinking the problem here? Since the per-window compute task is hugely expensive, stateless from window to window, and the original big matrix is just 1GB, the primary gain in using a parallel engine is in distributing and scheduling these (long-running, isolated) tasks. I'm reading

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
only need to replicate data across the boundaries of each partition of windows, rather than each window. How can this be written in spark scala? On Fri, Dec 20, 2013 at 2:53 PM, Christopher Nguyen c...@adatao.comwrote: Are we over-thinking the problem here? Since the per-window compute

Re: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered

2013-12-20 Thread Christopher Nguyen
MichaelY, this sort of thing where it could be any of dozens of things can usually be resolved by asking someone share your screen with you for 5 minutes. It's far more productive than guessing over emails. If @freeman is willing, you can send a private message to him to set that up over Google

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
at 9:43 PM, Christopher Nguyen c...@adatao.comwrote: Aureliano, how would your production data be coming in and accessed? It's possible that you can still think of that level as a serial operation (outer loop, large chunks) first before worrying about parallelizing the computation of the tiny

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
A couple of fixes inline. -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Fri, Dec 20, 2013 at 2:34 PM, Christopher Nguyen c...@adatao.com wrote: Aureliano, would something like this work? The red code is the only place where you have to think

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
, as there may be opportunities for parallel speed-ups there. Sent while mobile. Pls excuse typos etc. On Dec 20, 2013 2:56 PM, Aureliano Buendia buendia...@gmail.com wrote: On Fri, Dec 20, 2013 at 10:34 PM, Christopher Nguyen c...@adatao.comwrote: Aureliano, would something like this work

Re: Incremental Updates to an RDD

2013-12-10 Thread Christopher Nguyen
, and re-persist as an RDD? On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen c...@adatao.comwrote: Kyle, the fundamental contract of a Spark RDD is that it is immutable. This follows the paradigm where data is (functionally) transformed into other data, rather than mutated. This allows

Re: Incremental Updates to an RDD

2013-12-09 Thread Christopher Nguyen
' way to manage a distributed data set, which would then serve as an input to Spark RDDs? Kyle On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen c...@adatao.comwrote: Kyle, the fundamental contract of a Spark RDD is that it is immutable. This follows the paradigm where data

Re: Incremental Updates to an RDD

2013-12-06 Thread Christopher Nguyen
Kyle, the fundamental contract of a Spark RDD is that it is immutable. This follows the paradigm where data is (functionally) transformed into other data, rather than mutated. This allows these systems to make certain assumptions and guarantees that otherwise they wouldn't be able to. Now we've

Re: DataFrame RDDs

2013-11-15 Thread Christopher Nguyen
Shay, we've done this at Adatao, specifically a big data frame in RDD representation and subsetting/projections/data mining/machine learning algorithms on that in-memory table structure. We're planning to harmonize that with the MLBase work in the near future. Just a matter of prioritization on

Re: Not caching rdds, spark.storage.memoryFraction setting

2013-11-08 Thread Christopher Nguyen
Grega, the way to think about this setting is that it sets the maximum amount of memory Spark is allowed to use for caching RDDs before it must expire or spill them to disk. Spark in principle knows at all times how many RDDs are kept in memory and their total sizes, so it can for example persist

Re: Visitor function to RDD elements

2013-10-22 Thread Christopher Nguyen
Matt, it would be useful to back up one level to your problem statement. If it is strictly restricted as described, then you have a sequential problem that's not parallelizable. What is the primary design goal here? To complete the operation in the shortest time possible (big compute)? Or to be

Re: Visitor function to RDD elements

2013-10-22 Thread Christopher Nguyen
For better precision, s/Or to be able to handle very large data sets (big memory)/Or to be able to hold very large data sets in one place (big memory)/g -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Tue, Oct 22, 2013 at 2:16 PM, Christopher

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-10 Thread Christopher Nguyen
Ramkumar, it sounds like you can consider a file-parallel approach rather than a strict data-parallel parsing of the problem. In other words, separate the file copying task from the file parsing task. Have the driver program D handle the directory scan, which then parallelizes the file list into N