Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Bertrand Dechoux
Nice! I am especially interested in Bayesian Networks, which are only one of the many models that can be expressed by a factor graph representation. Do you do Bayesian Networks learning at scale (parameters and structure) with latent variables? Are you using publicly available tools for that?

SparkContext#cancelJobGroup : is it safe? Who got burn? Who is alive?

2016-06-14 Thread Bertrand Dechoux
e. 2. Who is or was using the *interruptOnCancel* ? Do you got burn? It is still working without any incident? Thanks in advance for any info, feedbacks and war stories. Bertrand Dechoux

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Bertrand Dechoux
The big question would be what feature of Esper your are using. Esper is a CEP solution. I doubt that Spark Streaming can do everything Esper does without any development. Spark (Streaming) is more a general-purpose platform. http://www.espertech.com/products/esper.php But I would be glad to be

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Bertrand Dechoux
Well, anyone can open an account on apache jira and post a new ticket/enhancement/issue/bug... Bertrand Dechoux On Fri, Jul 25, 2014 at 4:07 PM, Sparky gullo_tho...@bah.com wrote: Thanks for the suggestion. I can confirm that my problem is I have files with zero bytes. It's a known bug

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Bertrand Dechoux
Is there any documentation from cloudera on how to run Spark apps on CDH Manager deployed Spark ? Asking the cloudera community would be a good idea. http://community.cloudera.com/ In the end only Cloudera will fix quickly issues with CDH... Bertrand Dechoux On Wed, Jul 23, 2014 at 9:28 AM

Re: Large scale ranked recommendation

2014-07-18 Thread Bertrand Dechoux
And you might want to apply clustering before. It is likely that every user and every item are not unique. Bertrand Dechoux On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath nick.pentre...@gmail.com wrote: It is very true that making predictions in batch for all 1 million users against the 10k

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Bertrand Dechoux
functions with no side effect (ie the only impact is the returned results), then you just need to not take into account results from additional attempts of the same task/operator. Bertrand Dechoux On Tue, Jul 15, 2014 at 9:34 PM, Andrew Ash and...@andrewash.com wrote: Hi Nan, Great digging

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Bertrand Dechoux
A patch proposal on the apache JIRA for Spark? https://issues.apache.org/jira/browse/SPARK/ Bertrand On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani rahulbhojwani2...@gmail.com wrote: And also that there is a small bug in implementation. As I mentioned this earlier also. This is my first

Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
of it. Regards Bertrand Dechoux

Re: Shark vs Impala

2014-06-22 Thread Bertrand Dechoux
For the second question, I would say it is mainly because the projects have not the same aim. Impala does have a cost-based optimizer and predicate propagation capability which is natural because it is interpreting pseudo-SQL query. In the realm of relational database, it is often not a good idea

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Bertrand Dechoux
I guess you have to understand the difference of architecture. I don't know much about C++ MPI but it is basically MPI whereas Spark is inspired from Hadoop MapReduce and optimised for reading/writing large amount of data with a smart caching and locality strategy. Intuitively, if you have a high

Re: Hadoop 2.3 Centralized Cache vs RDD

2014-05-16 Thread Bertrand Dechoux
http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html We do not currently cache blocks which are under construction, corrupt, or otherwise incomplete. Have you tried with a file with more than 1 block? And

Re: Real world

2014-05-16 Thread Bertrand Dechoux
http://spark-summit.org ? Bertrand On Thu, May 8, 2014 at 2:05 AM, Ian Ferreira ianferre...@hotmail.comwrote: Folks, I keep getting questioned on real world experience of Spark as in mission critical production deployments. Does anyone have some war stories to share or know of resources

Re: PySpark still reading only text?

2014-04-22 Thread Bertrand Dechoux
Cool, thanks for the link. Bertrand Dechoux On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath nick.pentre...@gmail.comwrote: Also see: https://github.com/apache/spark/pull/455 This will add support for reading sequencefile and other inputformat in PySpark, as long as the Writables are either

PySpark still reading only text?

2014-04-16 Thread Bertrand Dechoux
Hi, I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case? From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed. Is it not possible to read

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Bertrand Dechoux
I don't know the Spark issue but the Hadoop context is clear. old api - org.apache.hadoop.mapred new api - org.apache.hadoop.mapreduce You might only need to change your import. Regards Bertrand On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre pbarapa...@gmail.com wrote: Hi,