Fast big data analytics with Spark on Tachyon in Baidu
Dear all, We’re organizing a meetup http://www.meetup.com/Tachyon/events/222485713/ on May 28th at IBM in Forster City that might be of interest to the Spark community. The focus is a production use case of Spark and Tachyon at Baidu. You can sign up here: http://www.meetup.com/Tachyon/events/222485713/ Hope some of you can make it! Best, Haoyuan
Re: Spark or Tachyon: capture data lineage
Agreed with Jerry. Aside from Tachyon, seeing this for general debugging would be very helpful. Haoyuan, is that feature you are referring to related to https://issues.apache.org/jira/browse/SPARK-975? In the interim, I've found the toDebugString() method useful (but it renders execution as a tree and not as a more general DAG and therefore doesn't always capture the flow in the way I'd like to review it). Example: a = sc.parallelize(range(1,1000)).map(lambda x: (x, x*x)).filter(lambda x: x[1]1000) b = a.join(a) print b.toDebugString() (16) PythonRDD[19] at RDD at PythonRDD.scala:43 | MappedRDD[17] at values at NativeMethodAccessorImpl.java:-2 | ShuffledRDD[16] at partitionBy at NativeMethodAccessorImpl.java:-2 +-(16) PairwiseRDD[15] at RDD at PythonRDD.scala:261 | PythonRDD[14] at RDD at PythonRDD.scala:43 | UnionRDD[13] at union at NativeMethodAccessorImpl.java:-2 | PythonRDD[11] at RDD at PythonRDD.scala:43 | ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315 | PythonRDD[12] at RDD at PythonRDD.scala:43 | ParallelCollectionRDD[10] at parallelize at PythonRDD.scala:315 Best, -Sven On Fri, Jan 2, 2015 at 12:32 PM, Haoyuan Li haoyuan...@gmail.com wrote: Jerry, Great question. Spark and Tachyon capture lineage information at different granularities. We are working on an integration between Spark/Tachyon about this. Hope to get it ready to be released soon. Best, Haoyuan On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote: Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)-C-E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/ -- http://sites.google.com/site/krasser/?utm_source=sig
Re: Spark or Tachyon: capture data lineage
Jerry, Great question. Spark and Tachyon capture lineage information at different granularities. We are working on an integration between Spark/Tachyon about this. Hope to get it ready to be released soon. Best, Haoyuan On Fri, Jan 2, 2015 at 12:24 PM, Jerry Lam chiling...@gmail.com wrote: Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)-C-E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Spark or Tachyon: capture data lineage
Hi spark developers, I was thinking it would be nice to extract the data lineage information from a data processing pipeline. I assume that spark/tachyon keeps this information somewhere. For instance, a data processing pipeline uses datasource A and B to produce C. C is then used by another process to produce D and E. Asumming A, B, C, D, E are stored on disk, It would be so useful if there is a way to capture this information when we are using spark/tachyon to query this data lineage information. For example, give me datasets that produce E. It should give me a graph like (A and B)-C-E. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry
Re: Spark on Tachyon
IMHO: cache doesn't provide redundancy, and its in the same jvm, so its much faster. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Tachyon-tp1463p20800.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org