Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Christopher Nguyen
Deepak, to be sure, I was referring to sequential guarantees with the longs. I would suggest being careful with taking half the UUID as the probability of collision can be unexpectedly high. Many bits of the UUID is typically time-based so collision among those bits is virtually guaranteed with pr

Re: GraphX with UUID vertex IDs instead of Long

2014-02-24 Thread Christopher Nguyen
Deepak, depending on your use case, you might find it appropriate and certainly easy create a lightweight sequence number service that serves requests from parallel clients. http://stackoverflow.com/questions/2671858/distributed-sequence-number-generation/5685869 There's no shame in using non-Spa

Re: Connecting an Application to the Cluster

2014-02-17 Thread Christopher Nguyen
David, actually, it's the driver that "creates" and holds a reference to the SparkContext. The master in this context is only a resource manager providing information about the cluster, being aware of where workers are, how many there are, etc. The SparkContext object can get serialized/deserializ

Re: working on a closed networkworking on a closed network - any recomendations

2014-02-09 Thread Christopher Nguyen
Eran, you could try what Patrick suggested, in detail: 1. Do a full build on a connected laptop, 2. Copy ~/.m2 and .ivy2 over, 3. Do mvn -o or sbt "set offline:=true" command; if that meets your needs. Sent while mobile. Pls excuse typos etc. On Feb 9, 2014 12:58 PM, "Patrick Wendell" wrote: > I

Re: RDD[URI]

2014-01-30 Thread Christopher Nguyen
Philip, I guess the key problem statement is the "large collection of" part? If so this may be helpful, at the HDFS level: http://blog.cloudera.com/blog/2009/02/the-small-files-problem/. Otherwise you can always start with an RDD[fileUri] and go from there to an RDD[(fileUri, read_contents)]. Sen

Re: Stream RDD to local disk

2014-01-30 Thread Christopher Nguyen
Andrew, couldn't you do in the Scala code: scala.sys.process.Process("hadoop fs -copyToLocal ...")! or is that still considered a second step? "hadoop fs" is almost certainly going to be better at copying these files than some memory-to-disk-to-memory serdes within Spark. -- Christopher T. Ng

Re: Please Help: Amplab Benchmark Performance

2014-01-29 Thread Christopher Nguyen
rk and Spark since they might be suitable for different > types of tasks. From what you have explained, is it OK to think Shark > is better off for SQL-like tasks, while Spark is more for iterative > machine learning algorithms? > > Cheers, > > -chen > > On Wed, Jan

Re: Please Help: Amplab Benchmark Performance

2014-01-29 Thread Christopher Nguyen
Chen, interesting comparisons you're trying to make. It would be great to share this somewhere when you're done. Some suggestions of non-obvious things to consider: In general there are any number of differences between Shark and some "equivalent" Spark implementation of the same query. Shark is

Re: RDD and Partition

2014-01-28 Thread Christopher Nguyen
rtitionPruningRDD. It says, "A RDD used > to prune RDD partitions/partitions so we can avoid launching tasks on all > partitions". I did not understand this exactly and I couldn't find any > sample code. Can we use this to apply a function only on certain partitions? >

Re: RDD and Partition

2014-01-28 Thread Christopher Nguyen
sult of using > runJob to call toUpperCase on the A-to-M partitions will be the uppercased > strings materialized in the driver process, not a transformation of the > original RDD. > > > On Tue, Jan 28, 2014 at 1:37 PM, Christopher Nguyen wrote: > >> David, >> >&

Re: RDD and Partition

2014-01-28 Thread Christopher Nguyen
David, map() would iterate row by row, forcing an if on each row. mapPartitions*() allows you to have a conditional on the whole partition first, as Mark suggests. That should usually be sufficient. SparkContext.runJob() allows you to specify which partitions to run on, if you're sure it's neces

RE: Can I share the RDD between multiprocess

2014-01-25 Thread Christopher Nguyen
understanding, SparkContext constructor creates an Akka actor system > and starts a jetty UI server. So can we somehow use / tweak the same to > serve to multiple clients? Or can we simply construct a spark context > inside a Java server (like Tomcat) ? > > > > Regards, >

Re: Can I share the RDD between multiprocess

2014-01-24 Thread Christopher Nguyen
D.Y., it depends on what you mean by "multiprocess". RDD lifecycles are currently limited to a single SparkContext. So to "share" RDDs you need to somehow access the same SparkContext. This means one way to share RDDs is to make sure your accessors are in the same JVM that started the SparkContex

Re: Forcing RDD computation with something else than count() ?

2014-01-22 Thread Christopher Nguyen
Guillaume, this is RDD.count() /** * Return the number of elements in the RDD. */ def count(): Long = { sc.runJob(this, (iter: Iterator[T]) => { // Use a while loop to count the number of elements rather than iter.size because // iter.size uses a for loop, which is

Re: Consistency between RDD's and Native File System

2014-01-17 Thread Christopher Nguyen
user has a single reference (*) (*) yep this is based on DStream/TD's work and will be available soon. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Thu, Jan 16, 2014 at 9:33 PM, Christopher Nguyen wrote: > Mark, that'

Re: Consistency between RDD's and Native File System

2014-01-16 Thread Christopher Nguyen
line, two line, red line, blue line > > 3) Edit silliness.txt so that it is now: > > and now > for something > completely > different > > 4) Continue on with spark-shell: > > scala> println(lines.collect.mkString(", ")) > . > . > . > and no

Re: Consistency between RDD's and Native File System

2014-01-16 Thread Christopher Nguyen
Sai, from your question, I infer that you have an interpretation that RDDs are somehow an in-memory/cached copy of the underlying data source---and so there is some expectation that there is some synchronization model between the two. That would not be what the RDD model is. RDDs are first-class,

Re: WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

2014-01-14 Thread Christopher Nguyen
Aureliano, this sort of jar-hell is something we have to deal with, whether Spark or elsewhere. How would you propose we fix this with Spark? Do you mean that Spark's own scaffolding caused you to pull in both Protobuf 2.4 and 2.5? Or do you mean the error message should have been more helpful? Se

Re: what paper is the L2 regularization based on?

2014-01-08 Thread Christopher Nguyen
Walrus, given the question, this may be a good place for you to start. There's some good discussion there as well as links to papers. http://www.quora.com/Machine-Learning/What-is-the-difference-between-L1-and-L2-regularization Sent while mobile. Pls excuse typos etc. On Jan 8, 2014 2:24 PM, "Wal

Re: Is spark-env.sh supposed to be stateless?

2014-01-03 Thread Christopher Nguyen
How about this: https://github.com/apache/incubator-spark/pull/326 -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Thu, Jan 2, 2014 at 11:07 PM, Matei Zaharia wrote: > I agree that it would be good to do it only once, if you can find a nice > w

Re: Not able to understand Exception.

2014-01-01 Thread Christopher Nguyen
Archit, this occurs in the ResultTask phase, triggered by the call to sortByKey. Prior to this, your RDD would have been serialized for, e.g., shuffling around. So it looks like Kryo wasn't able to deserialize some part of the RDD for some reason, possible due to formatting incompatibility. Did yo

Re: How to map each line to (line number, line)?

2013-12-31 Thread Christopher Nguyen
It's a reasonable ask (row indices) in some interactive use cases we've come across. We're working on providing support for this at a higher level of abstraction. Sent while mobile. Pls excuse typos etc. On Dec 31, 2013 11:34 AM, "Aureliano Buendia" wrote: > > > > On Mon, Dec 30, 2013 at 8:31 PM

Re: Stateful RDD

2013-12-30 Thread Christopher Nguyen
Bao, to help clarify what TD is saying: Spark launches multiple workers on multiple threads in parallel, running the same closure code in the same JVM on the same machine, but operating on different rows of data. Because of this parallelism, if that worker code weren't thread-safe for some reason,

Re: Stateful RDD

2013-12-27 Thread Christopher Nguyen
Bao, as described, your use case doesn't need to invoke anything like custom RDDs or DStreams. In a call like val resultRdd = scripts.map(s => ScriptEngine.eval(s)) Spark will do its best to serialize/deserialize ScriptEngine to each of the workers---if ScriptEngine is Serializable. Now, if

Re: multi-line elements

2013-12-24 Thread Christopher Nguyen
Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third appr

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
ng-compute, as there may be opportunities for parallel speed-ups there. Sent while mobile. Pls excuse typos etc. On Dec 20, 2013 2:56 PM, "Aureliano Buendia" wrote: > > > > On Fri, Dec 20, 2013 at 10:34 PM, Christopher Nguyen wrote: > >> Aureliano, would something

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
A couple of fixes inline. -- Christopher T. Nguyen Co-founder & CEO, Adatao <http://adatao.com> linkedin.com/in/ctnguyen On Fri, Dec 20, 2013 at 2:34 PM, Christopher Nguyen wrote: > Aureliano, would something like this work? The red code is the only place > where you have t

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
n Fri, Dec 20, 2013 at 9:43 PM, Christopher Nguyen wrote: > >> Aureliano, how would your production data be coming in and accessed? It's >> possible that you can still think of that level as a serial operation >> (outer loop, large chunks) first before worrying about paralleliz

Re: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered

2013-12-20 Thread Christopher Nguyen
MichaelY, this sort of thing where "it could be any of dozens of things" can usually be resolved by asking someone share your screen with you for 5 minutes. It's far more productive than guessing over emails. If @freeman is willing, you can send a private message to him to set that up over Google

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
s a >>> lot of fun, how about this: do a mapPartitions with a threaded subtask for >>> each window. Now you only need to replicate data across the boundaries of >>> each partition of windows, rather than each window. >>> >> >> How can this be written in

Re: How to access a sub matrix in a spark task?

2013-12-20 Thread Christopher Nguyen
Are we over-thinking the problem here? Since the per-window compute task is hugely expensive, stateless from window to window, and the original big matrix is just 1GB, the primary gain in using a parallel engine is in distributing and scheduling these (long-running, isolated) tasks. I'm reading tha

Re: DoubleMatrix vs Array[Array[Double]] : Question about debugging serialization performance issues

2013-12-19 Thread Christopher Nguyen
Guillaume seemed to be able to do this on a per-iteration basis, so it is reasonable to expect that it can be done once. So it's a 50-50 call that it may indeed be something that was "unknowingly changed". Also, are you reading the data and parsing in on the slaves, or really serializing it from on

Re: Incremental Updates to an RDD

2013-12-10 Thread Christopher Nguyen
the RDD, push a new value on the head, filter out the > oldest value, and > re-persist as an RDD? > > > > > On Fri, Dec 6, 2013 at 10:13 PM, Christopher Nguyen wrote: > >> Kyle, the fundamental contract of a Spark RDD is that it is immutable. >> This follo

Re: Incremental Updates to an RDD

2013-12-09 Thread Christopher Nguyen
s. If I already held the dataset in the Spark > processes, then I could start calculations immediately. > So is there is a 'better' way to manage a distributed data set, which > would then serve as an input to Spark RDDs? > > Kyle > > > > > On Fri, Dec 6, 2013 a

Re: Writing an RDD to Hive

2013-12-08 Thread Christopher Nguyen
Philip, fwiw we do go with including Shark as a dependency for our needs, making a fat jar, and it works very well. It was quite a bit of pain what with the Hadoop/Hive transitive dependencies, but for us it was worth it. I hope that serves as an existence proof that says Mt Everest has been climb

Re: Incremental Updates to an RDD

2013-12-06 Thread Christopher Nguyen
Kyle, the fundamental contract of a Spark RDD is that it is immutable. This follows the paradigm where data is (functionally) transformed into other data, rather than mutated. This allows these systems to make certain assumptions and guarantees that otherwise they wouldn't be able to. Now we've be

Re: Benchmark numbers for terabytes of data

2013-12-04 Thread Christopher Nguyen
Matt, we've done 1TB linear models in 2-3 minutes on 40 node clusters (30GB/node, just enough to hold all partitions simultaneously in memory). You can do with fewer nodes if you're willing to slow things down. Some of our TB benchmark numbers are available in my Spark Summit slides. Sorry I'm on

Re: Using Spark/Shark for massive time series data storage/processing

2013-11-22 Thread Christopher Nguyen
Yizheng, one thing you can do is to use Spark Streaming to ingest the incoming data and store it in HDFS. You can use Shark to impose a schema on this data. Spark/Shark can easily handle 30GB/day. For visualization and analysis, you may want to take a look at Adatao pAnalytics for R, which is buil

Re: Does spark RDD has a partitionedByKey

2013-11-15 Thread Christopher Nguyen
Jiacheng, if you're OK with using the Shark layer above Spark (and I think for many use cases the answer would be "yes"), then you can take advantage of Shark's co-partitioning. Or do something like https://github.com/amplab/shark/pull/100/commits Sent while mobile. Pls excuse typos etc. On Nov 16

Re: DataFrame RDDs

2013-11-15 Thread Christopher Nguyen
Sure, Shay. Let's connect offline. Sent while mobile. Pls excuse typos etc. On Nov 16, 2013 2:27 AM, "Shay Seng" wrote: > Nice, any possibility of sharing this code in advance? > > > On Fri, Nov 15, 2013 at 11:22 AM, Christopher Nguyen wrote: > >> Shay, we&#x

Re: DataFrame RDDs

2013-11-15 Thread Christopher Nguyen
Shay, we've done this at Adatao, specifically a big data frame in RDD representation and subsetting/projections/data mining/machine learning algorithms on that in-memory table structure. We're planning to harmonize that with the MLBase work in the near future. Just a matter of prioritization on li

Re: Not caching rdds, spark.storage.memoryFraction setting

2013-11-08 Thread Christopher Nguyen
Grega, the way to think about this setting is that it sets the maximum amount of memory Spark is allowed to use for caching RDDs before it must expire or spill them to disk. Spark in principle knows at all times how many RDDs are kept in memory and their total sizes, so it can for example persist t

Re: almost sorted data

2013-10-28 Thread Christopher Nguyen
Nathan: why iterator semantics could be more appropriate than a materialized list: the partition data could be sitting on disk, which could be streamed into RAM upon access, and which could be left untouched if the algo decided by some condidtional logic that it didn't need it. And you could always

Re: Visitor function to RDD elements

2013-10-22 Thread Christopher Nguyen
Ah, this is a slightly different problem statement, in that you may still gain in taking advantage of Spark's parallelization for the data transformation. If you want to avoid the serdes+network overhead of sending the results back to the driver, and instead have a consumer/sink to send the result

Re: Visitor function to RDD elements

2013-10-22 Thread Christopher Nguyen
n Tue, Oct 22, 2013 at 2:16 PM, Christopher Nguyen wrote: > Matt, it would be useful to back up one level to your problem statement. > If it is strictly restricted as described, then you have a sequential > problem that's not parallelizable. What is the primary design goal here? To > compl

Re: Visitor function to RDD elements

2013-10-22 Thread Christopher Nguyen
Matt, it would be useful to back up one level to your problem statement. If it is strictly restricted as described, then you have a sequential problem that's not parallelizable. What is the primary design goal here? To complete the operation in the shortest time possible ("big compute")? Or to be a

Re: A program that works in local mode but fails in distributed mode

2013-10-16 Thread Christopher Nguyen
Markus, I hear you. Sometimes things just behave the way we don't expect. http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4018748 -- Christopher T. Nguyen Co-founder & CEO, Adatao linkedin.com/in/ctnguyen On Wed, Oct 16, 2013 at 12:30 PM, Markus Losoi wrote: > > Markus,

Re: A program that works in local mode but fails in distributed mode

2013-10-15 Thread Christopher Nguyen
Markus, I'm guessing at your config which is probably one or more workers running on the same node as your master, and the same node as your spark-shell. Which is why you're expecting the workers to be able to read the same relative path? If that's the case, the reason it doesn't work out as you e

Re: Drawback of Spark memory model as compared to Hadoop?

2013-10-13 Thread Christopher Nguyen
Howard, > - node failure? > - not able to handle if intermediate data > memory size of a node > - cost Spark uses recomputation, aka "journaling" to provide resiliency in case of node failure, thus providing node-level recovery behavior much as Hadoop MapReduce, except much faster recovery in mos

Re: Output to a single directory with multiple files rather multiple directories ?

2013-10-10 Thread Christopher Nguyen
Ramkumar, it sounds like you can consider a file-parallel approach rather than a strict data-parallel parsing of the problem. In other words, separate the file copying task from the file parsing task. Have the driver program D handle the directory scan, which then parallelizes the file list into N