Re: Is there a way to write spark RDD to Avro files

2014-08-02 Thread Fengyun RAO
Below works for me: val job = Job.getInstance val schema = Schema.create(Schema.Type.STRING) AvroJob.setOutputKeySchema(job, schema) records.map(item = (new AvroKey[String](item.getGridsumId), NullWritable.get())) .saveAsNewAPIHadoopFile(args(1),

Re: Spark SQL, Parquet and Impala

2014-08-02 Thread Patrick McGloin
Hi Michael, Thanks for your reply. Is this the correct way to load data from Spark into Parquet? Somehow it doesn't feel right. When we followed the steps described for storing the data into Hive tables everything was smooth, we used HiveContext and the table is automatically recognised by

Spark ReduceByKey - Working in Java

2014-08-02 Thread Anil Karamchandani
Hi, I am a complete newbie to spark and map reduce frameworks and have a basic question on the reduce function. I was working on the word count example and was stuck at the reduce stage where the sum happens. I am trying to understand the working of the reducebykey in Spark using java as the

Re: Is there a way to write spark RDD to Avro files

2014-08-02 Thread touchdown
YES! This worked! thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11245.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark ReduceByKey - Working in Java

2014-08-02 Thread Sean Owen
I think your questions revolve around the reduce function here, which is a function of 2 arguments returning 1, whereas in a Reducer, you implement a function of many-to-many. This API is simpler if less general. Here you provide an associative operation that can reduce any 2 values down to 1

Re: Iterator over RDD in PySpark

2014-08-02 Thread Andrei
Excellent, thank you! On Sat, Aug 2, 2014 at 4:46 AM, Aaron Davidson ilike...@gmail.com wrote: Ah, that's unfortunate, that definitely should be added. Using a pyspark-internal method, you could try something like javaIterator = rdd._jrdd.toLocalIterator() it =

Re: spark sql

2014-08-02 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please help me to resolve above compilation issue. Regards, Rajesh On Sat, Aug 2, 2014 at 2:02 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, I'm not able to print the values from Spark Sql JavaSchemaRDD. Please find below my code

RE: spark sql

2014-08-02 Thread N . Venkata Naga Ravi
Hi Rajesh, Can you recheck the version and your code again? I tried similar below code and its work fine (compiles and executes)... // Apply a schema to an RDD of Java Beans and register it as a table. JavaSchemaRDD schemaPeople = sqlCtx.applySchema(people, Person.class);

Re: spark sql

2014-08-02 Thread Ted Yu
I noticed misspelling in compilation error (extra letter 'a'): new Function*a* But in your code the spelling was right. A bit confused. On Fri, Aug 1, 2014 at 1:32 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, I'm not able to print the values from Spark Sql

GraphX

2014-08-02 Thread Deep Pradhan
Hi, I am running Spark in a single node cluster. I am able to run the codes in Spark like SparkPageRank.scala, SparkKMeans.scala by the following command, bin/run-examples org.apache.spark.examples.SparkPageRank and the required things Now, I want to run the Pagerank.scala that is there in GraphX.

Re: GraphX

2014-08-02 Thread Yifan LI
Try this: ./bin/run-example graphx.LiveJournalPageRank edge_list_file… On Aug 2, 2014, at 5:55 PM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, I am running Spark in a single node cluster. I am able to run the codes in Spark like SparkPageRank.scala, SparkKMeans.scala by the following

[GraphX] how to compute only a subset of vertices in the whole graph?

2014-08-02 Thread Yifan LI
Hi, I just implemented our algorithm(like personalised pagerank) using Pregel api, and it seems works well. But I am thinking of if I can compute only some selected vertexes(hubs), not to do update on every vertex… is it possible to do this using Pregel API? or, more realistically, only

Low Level Kafka Consumer for Spark

2014-08-02 Thread Dibyendu Bhattacharya
Hi, I have implemented a Low Level Kafka Consumer for Spark Streaming using Kafka Simple Consumer API. This API will give better control over the Kafka offset management and recovery from failures. As the present Spark KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better

Re: GraphX

2014-08-02 Thread Ankur Dave
At 2014-08-02 21:29:33 +0530, Deep Pradhan pradhandeep1...@gmail.com wrote: How should I run graphx codes? At the moment it's a little more complicated to run the GraphX algorithms than the Spark examples due to SPARK-1986 [1]. There is a driver program in org.apache.spark.graphx.lib.Analytics

Re: [GraphX] how to compute only a subset of vertices in the whole graph?

2014-08-02 Thread Ankur Dave
At 2014-08-02 19:04:22 +0200, Yifan LI iamyifa...@gmail.com wrote: But I am thinking of if I can compute only some selected vertexes(hubs), not to do update on every vertex… is it possible to do this using Pregel API? The Pregel API already only runs vprog on vertices that received messages

Re: Compiling Spark master (284771ef) with sbt/sbt assembly fails on EC2

2014-08-02 Thread DB Tsai
I ran into this issue as well. The workaround by copying jar and ivy manually suggested by Shivaram works for me. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Aug 1, 2014 at 3:31

Re: SQLCtx cacheTable

2014-08-02 Thread Michael Armbrust
I am not a mesos expert... but it sounds like there is some mismatch between the size that mesos is giving you and the maximum heap size of the executors (-Xmx). On Fri, Aug 1, 2014 at 12:07 AM, Gurvinder Singh gurvinder.si...@uninett.no wrote: It is not getting out of memory exception. I am

Re: Spark-sql with Tachyon cache

2014-08-02 Thread Michael Armbrust
We are investigating various ways to integrate with Tachyon. I'll note that you can already use saveAsParquetFile and parquetFile(...).registerAsTable(tableName) (soon to be registerTempTable in Spark 1.1) to store data into tachyon and query it with Spark SQL. On Fri, Aug 1, 2014 at 1:42 AM,

Re: Spark SQL Query Plan optimization

2014-08-02 Thread Michael Armbrust
The number of partitions (which decides the number of tasks) is fixed after any shuffle and can be configured using 'spark.sql.shuffle.partitions' though SQLConf (i.e. sqlContext.set(...) or SET spark.sql.shuffle.partitions=... in sql) It is possible we will auto select this based on statistics