Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
I wonder if the issue is that these lines just need to add preservesPartitioning = true ? https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38 I am getting the feeling this is an issue w/ pyspark On Thu, Feb 12, 2015 at 10:43 AM, Imran Rashid iras...@cloudera.com wrote: ah,

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Ted Yu
You can use JavaPairRDD which has: override def wrapRDD(rdd: RDD[(K, V)]): JavaPairRDD[K, V] = JavaPairRDD.fromRDD(rdd) Cheers On Thu, Feb 12, 2015 at 7:36 AM, Vladimir Protsenko protsenk...@gmail.com wrote: Hi. I am stuck with how to save file to hdfs from spark. I have written

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy evaluation of RDDs. After you do your partitionBy, are you sure those RDDs are actually materialized cached somewhere? eg., if you just did

Re: saveAsHadoopFile is not a member of ... RDD[(String, MyObject)]

2015-02-12 Thread Imran Rashid
You need to import the implicit conversions to PairRDDFunctions with import org.apache.spark.SparkContext._ (note that this requirement will go away in 1.3: https://issues.apache.org/jira/browse/SPARK-4397) On Thu, Feb 12, 2015 at 9:36 AM, Vladimir Protsenko protsenk...@gmail.com wrote: Hi. I

Re: obtain cluster assignment in K-means

2015-02-12 Thread Shi Yu
Thanks Robin, got it. On Thu, Feb 12, 2015 at 2:21 AM, Robin East robin.e...@xense.co.uk wrote: KMeans.train actually returns a KMeansModel so you can use predict() method of the model e.g. clusters.predict(pointToPredict) or clusters.predict(pointsToPredict) first is a single Vector,

Re: Shuffle on joining two RDDs

2015-02-12 Thread Sean Owen
Doesn't this require that both RDDs have the same partitioner? On Thu, Feb 12, 2015 at 3:48 PM, Imran Rashid iras...@cloudera.com wrote: Hi Karlson, I think your assumptions are correct -- that join alone shouldn't require any shuffling. But its possible you are getting tripped up by lazy

failing GraphX application ('GC overhead limit exceeded', 'Lost executor', 'Connection refused', etc.)

2015-02-12 Thread Matthew Cornell
Hi Folks, I'm running a five-step path following-algorithm on a movie graph with 120K verticies and 400K edges. The graph has vertices for actors, directors, movies, users, and user ratings, and my Scala code is walking the path rating movie rating user rating. There are 75K rating nodes

Re: [hive context] Unable to query array once saved as parquet

2015-02-12 Thread Ayoub
Hi, as I was trying to find a work around until this bug will be fixed, I discovered an other bug posted here: https://issues.apache.org/jira/browse/SPARK-5775 For those who might had the same issue, one could use the LOAD sql command in a hive context to load the parquet file into the table as

Re: Shuffle on joining two RDDs

2015-02-12 Thread Imran Rashid
ah, sorry I am not too familiar w/ pyspark, sorry I missed that part. It could be that pyspark doesn't properly support narrow dependencies, or maybe you need to be more explicit about the partitioner. I am looking into the pyspark api but you might have some better guesses here than I thought.

8080 port password protection

2015-02-12 Thread MASTER_ZION (Jairo Linux)
Hi everyone, Im creating a development machine in AWS and i would like to protect the port 8080 using a password. Is it possible? Best Regards *Jairo Moreno*

Custom Kryo serializer

2015-02-12 Thread Corey Nolet
I'm trying to register a custom class that extends Kryo's Serializer interface. I can't tell exactly what Class the registerKryoClasses() function on the SparkConf is looking for. How do I register the Serializer class?

<    1   2