Re: [ML] [GraphFrames] : Bayesian Network framework

2016-12-30 Thread Felix Cheung
GraphFrames has a Belief Propagation example Have you checked it out? graphframes.github.io/api/scala/index.html#org.graphframes.examples.BeliefPropagation$ From:

Re: ml word2vec finSynonyms return type

2016-12-30 Thread Felix Cheung
Could you link to the JIRA here? What you suggest makes sense to me. Though we might want to maintain compatibility and add a new method instead of changing the return type of the existing one. _ From: Asher Krim > Sent:

Re: RDD Location

2016-12-30 Thread Fei Hu
It will be very appreciated if you can give more details about why runJob function could not be called in getPreferredLocations() In the NewHadoopRDD class and HadoopRDD class, they get the location information from the inputSplit. But there may be an issue in NewHadoopRDD, because it generates

Re: RDD Location

2016-12-30 Thread Sun Rui
You can’t call runJob inside getPreferredLocations(). You can take a look at the source code of HadoopRDD to help you implement getPreferredLocations() appropriately. > On Dec 31, 2016, at 09:48, Fei Hu wrote: > > That is a good idea. > > I tried add the following code to

context.runJob() was suspended in getPreferredLocations() function

2016-12-30 Thread Fei Hu
Dear all, I tried to customize my own RDD. In the getPreferredLocations() function, I used the following code to query anonter RDD, which was used as an input to initialize this customized RDD: * val results: Array[Array[DataChunkPartition]] = context.runJob(partitionsRDD,

Re: RDD Location

2016-12-30 Thread Fei Hu
That is a good idea. I tried add the following code to get getPreferredLocations() function: val results: Array[Array[DataChunkPartition]] = context.runJob( partitionsRDD, (context: TaskContext, partIter: Iterator[DataChunkPartition]) => partIter.toArray, dd, allowLocal = true) But it

[ML] [GraphFrames] : Bayesian Network framework

2016-12-30 Thread Brian Cajes
Hi, I'm interested in using (or contributing to an implementation) of a Bayesian Network framework within Spark. Similar to https://github.com/jmschrei/pomegranate/blob/master/examples/bayesnet_monty_hall_train.ipynb . I've found a related library for spark:

Re: repeated unioning of dataframes take worse than O(N^2) time

2016-12-30 Thread Liang-Chi Hsieh
Actually, as you use Dataset's union API, unlike RDD's union API, it will break the nested structure. So that should not be the issue. The additional time introduced when the number of dataframes grows, is spent on analysis stage. I can think that as the Union has a long children list, the