Fast big data analytics with Spark on Tachyon in Baidu

2015-05-12 Thread Haoyuan Li
Dear all,

We’re organizing a meetup http://www.meetup.com/Tachyon/events/222485713/ on
May 28th at IBM in Forster City that might be of interest to the Spark
community. The focus is a production use case of Spark and Tachyon at Baidu.

You can sign up here: http://www.meetup.com/Tachyon/events/222485713/

Hope some of you can make it!

Best,

Haoyuan


Re: Spark or Tachyon: capture data lineage

2015-01-02 Thread Sven Krasser
Agreed with Jerry. Aside from Tachyon, seeing this for general debugging
would be very helpful.

Haoyuan, is that feature you are referring to related to
https://issues.apache.org/jira/browse/SPARK-975?

In the interim, I've found the toDebugString() method useful (but it
renders execution as a tree and not as a more general DAG and therefore
doesn't always capture the flow in the way I'd like to review it). Example:

 a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda
x: x[1]1000)
 b = a.join(a)
 print b.toDebugString()
(16) PythonRDD[19] at RDD at PythonRDD.scala:43
 |   MappedRDD[17] at values at NativeMethodAccessorImpl.java:-2
 |   ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2
 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261
|   PythonRDD[14] at RDD at PythonRDD.scala:43
|   UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2
|   PythonRDD[11] at RDD at PythonRDD.scala:43
|   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315
|   PythonRDD[12] at RDD at PythonRDD.scala:43
|   ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315

Best,
-Sven

On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li haoyuan...@gmail.com wrote:

 Jerry,

 Great question. Spark and Tachyon capture lineage information at different
 granularities. We are working on an integration between Spark/Tachyon about
 this. Hope to get it ready to be released soon.

 Best,

 Haoyuan

 On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark developers,

 I was thinking it would be nice to extract the data lineage information
 from a data processing pipeline. I assume that spark/tachyon keeps this
 information somewhere. For instance, a data processing pipeline uses
 datasource A and B to produce C. C is then used by another process to
 produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
 useful if there is a way to capture this information when we are using
 spark/tachyon to query this data lineage information. For example, give me
 datasets that produce E. It should give me  a graph like (A and B)-C-E.

 Is this something already possible with spark/tachyon? If not, do you
 think it is possible? Does anyone mind to share their experience in
 capturing the data lineage in a data processing pipeline?

 Best Regards,

 Jerry




 --
 Haoyuan Li
 AMPLab, EECS, UC Berkeley
 http://www.cs.berkeley.edu/~haoyuan/




-- 
http://sites.google.com/site/krasser/?utm_source=sig


Re: Spark or Tachyon: capture data lineage

2015-01-02 Thread Haoyuan Li
Jerry,

Great question. Spark and Tachyon capture lineage information at different
granularities. We are working on an integration between Spark/Tachyon about
this. Hope to get it ready to be released soon.

Best,

Haoyuan

On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote:

 Hi spark developers,

 I was thinking it would be nice to extract the data lineage information
 from a data processing pipeline. I assume that spark/tachyon keeps this
 information somewhere. For instance, a data processing pipeline uses
 datasource A and B to produce C. C is then used by another process to
 produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
 useful if there is a way to capture this information when we are using
 spark/tachyon to query this data lineage information. For example, give me
 datasets that produce E. It should give me  a graph like (A and B)-C-E.

 Is this something already possible with spark/tachyon? If not, do you
 think it is possible? Does anyone mind to share their experience in
 capturing the data lineage in a data processing pipeline?

 Best Regards,

 Jerry




-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Spark or Tachyon: capture data lineage

2015-01-02 Thread Jerry Lam
Hi spark developers,

I was thinking it would be nice to extract the data lineage information
from a data processing pipeline. I assume that spark/tachyon keeps this
information somewhere. For instance, a data processing pipeline uses
datasource A and B to produce C. C is then used by another process to
produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so
useful if there is a way to capture this information when we are using
spark/tachyon to query this data lineage information. For example, give me
datasets that produce E. It should give me  a graph like (A and B)-C-E.

Is this something already possible with spark/tachyon? If not, do you think
it is possible? Does anyone mind to share their experience in capturing the
data lineage in a data processing pipeline?

Best Regards,

Jerry


Re: Spark on Tachyon

2014-12-20 Thread Peng Cheng
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much
faster.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org