I wonder if the issue is that these lines just need to add
preservesPartitioning = true
?
https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38
I am getting the feeling this is an issue w/ pyspark
On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid iras...@cloudera.com wrote:
ah,
You can use JavaPairRDD which has:
override def wrapRDD(rdd: RDD[(K, V)]): JavaPairRDD[K, V] =
JavaPairRDD.fromRDD(rdd)
Cheers
On Thu, Feb 12, 2015 at 7:36 AM, Vladimir Protsenko protsenk...@gmail.com
wrote:
Hi. I am stuck with how to save file to hdfs from spark.
I have written
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn't require
any shuffling. But its possible you are getting tripped up by lazy
evaluation of RDDs. After you do your partitionBy, are you sure those RDDs
are actually materialized cached somewhere? eg., if you just did
You need to import the implicit conversions to PairRDDFunctions with
import org.apache.spark.SparkContext._
(note that this requirement will go away in 1.3:
https://issues.apache.org/jira/browse/SPARK-4397)
On Thu, Feb 12, 2015 at 9:36 AM, Vladimir Protsenko protsenk...@gmail.com
wrote:
Hi. I
Thanks Robin, got it.
On Thu, Feb 12, 2015 at 2:21 AM, Robin East robin.e...@xense.co.uk wrote:
KMeans.train actually returns a KMeansModel so you can use predict()
method of the model
e.g. clusters.predict(pointToPredict)
or
clusters.predict(pointsToPredict)
first is a single Vector,
Doesn't this require that both RDDs have the same partitioner?
On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote:
Hi Karlson,
I think your assumptions are correct -- that join alone shouldn't require
any shuffling. But its possible you are getting tripped up by lazy
Hi Folks,
I'm running a five-step path following-algorithm on a movie graph with 120K
verticies and 400K edges. The graph has vertices for actors, directors, movies,
users, and user ratings, and my Scala code is walking the path rating movie
rating user rating. There are 75K rating nodes
Hi,
as I was trying to find a work around until this bug will be fixed, I
discovered an other bug posted here:
https://issues.apache.org/jira/browse/SPARK-5775
For those who might had the same issue, one could use the LOAD sql
command in a hive context to load the parquet file into the table as
ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It
could be that pyspark doesn't properly support narrow dependencies, or
maybe you need to be more explicit about the partitioner. I am looking
into the pyspark api but you might have some better guesses here than I
thought.
Hi everyone,
Im creating a development machine in AWS and i would like to protect the
port 8080 using a password.
Is it possible?
Best Regards
*Jairo Moreno*
I'm trying to register a custom class that extends Kryo's Serializer
interface. I can't tell exactly what Class the registerKryoClasses()
function on the SparkConf is looking for.
How do I register the Serializer class?
101 - 111 of 111 matches
Mail list logo