Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Ted Yu
Can you show us the function you passed to reduceByKey() ? What release of Spark are you using ? Cheers On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke lovejay-lovemu...@outlook.com wrote: I'm trying to solve a Word-Count like problem, the difference lies in that, I need the count of a specific

spark with kafka

2015-04-18 Thread Shushant Arora
Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low level(simple api) of kafka consumer. Thanks

MLlib -Collaborative Filtering

2015-04-18 Thread riginos
Is there any way that i can see the similarity table of 2 users in that algorithm? by that i mean the similarity between 2 users -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22553.html Sent from the Apache Spark User List

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora shushantaror...@gmail.com wrote: Hi I want to consume messages from kafka queue using spark batch program not spark streaming, Is there any way to achieve this, other than using low

RE: spark with kafka

2015-04-18 Thread Ganelin, Ilya
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from HDFS. Sent with Good (www.good.com) -Original Message- From: Shushant Arora [shushantaror...@gmail.commailto:shushantaror...@gmail.com] Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time To:

Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
I'm trying to solve a Word-Count like problem, the difference lies in that, I need the count of a specific word among a specific timespan in a social message stream. My data is in the format of (time, message), and I transformed (flatMap etc.) it into a series of (time, word_id), the time is

Number of input partitions in SparkContext.sequenceFile

2015-04-18 Thread Wenlei Xie
Hi, I am wondering the mechanism that determines the number of partitions created by SparkContext.sequenceFile ? For example, although my file has only 4 splits, Spark would create 16 partitions for it. Is it determined by the file size? Is there any way to control it? (Looks like I can only

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Sean Owen
Do these datetime objects implement a the notion of equality you'd expect? (This may be a dumb question; I'm thinking of the equivalent of equals() / hashCode() from the Java world.) On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke lovejay-lovemu...@outlook.com wrote: I'm trying to solve a

Re: Can't get SparkListener to work

2015-04-18 Thread Praveen Balaji
Thanks for the response, Archit. I get callbacks when I do not throw an exception from map. My use case, however, is to get callbacks for exceptions in transformations on executors. Do you think I'm going down the right route? Cheers -p On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
The comparison of Python tuple is lexicographical(that is, two tuple equals if and only if every element is same and in the same position), and though I'm not very clear with what's going on inside Spark, I have tried comparing the duplicate keys in the result with == operator, and hashing them

shuffle.FetchFailedException in spark on YARN job

2015-04-18 Thread roy
Hi, My spark job is failing with following error message org.apache.spark.shuffle.FetchFailedException: /mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index (No such file or directory) at

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
I don't think my experiment is suprising, it's my fault: To move away from my case, I wrote a test program, which generates data randomly, and cast the key to string: import randomimport operator COUNT = 2COUNT_PARTITIONS = 36LEN = 233 rdd = sc.parallelize(((str(random.randint(1, LEN)), 1)

Can a map function return null

2015-04-18 Thread Steve Lewis
I find a number of cases where I have an JavaRDD and I wish to transform the data and depending on a test return 0 or one item (don't suggest a filter - the real case is more complex). So I currently do something like the following - perform a flatmap returning a list with 0 or 1 entry depending

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
KafkaRDD uses the simple consumer api. and i think you need to handle offsets yourself, unless things changed since i last looked. I would do second approach. On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora shushantaror...@gmail.com wrote: Thanks !! I have few more doubts : Does kafka RDD

newAPIHadoopRDD file name

2015-04-18 Thread Manas Kar
I would like to get the file name along with the associated objects so that I can do further mapping on it. My code below gives me AvroKey[myObject], NullWritable but I don't know how to get the file that gave those objects. sc.newAPIHadoopRDD(job.getConfiguration,

RE: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread SecondDatke
Well, I'm changing the strategy. (Assuming we have N(words) 10, interval between two time points 1 hour) * Use only the timestamp as the key, no duplicate key. But it would only reduce to something like the total number of words in a timespan. * Key = tuple(timestamp, word_id), duplication.

Re: Does reduceByKey only work properly for numeric keys?

2015-04-18 Thread Ted Yu
Looks like the two keys involving (datetime.datetime(2009, 10, 6, 3, 0) in your first email might have been processed by x86 node and x64 node, respectively. Cheers On Sat, Apr 18, 2015 at 11:09 AM, SecondDatke lovejay-lovemu...@outlook.com wrote: I don't think my experiment is suprising,

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
I mean to say it is simpler in case of failures, restarts, upgrades, etc. Not just failures. But they did do a lot of work on streaming from kafka in spark 1.3.x to make it simpler (streaming simple calls KafkaRDD for every batch if you use KafkaUtils.createDirectStream), so maybe i am wrong and

Re: Can a map function return null

2015-04-18 Thread Olivier Girardot
You can return an RDD with null values inside, and afterwards filter on item != null In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala they're directly usable from Spark. Exemple : sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item) else

spark sql error with proto/parquet

2015-04-18 Thread Abhishek R. Singh
I have created a bunch of protobuf based parquet files that I want to read/inspect using Spark SQL. However, I am running into exceptions and not able to proceed much further: This succeeds successfully (probably because there is no action yet). I can also printSchema() and count() without any

Dataframes Question

2015-04-18 Thread Arun Patel
Experts, I have few basic questions on DataFrames vs Spark SQL. My confusion is more with DataFrames. 1) What is the difference between Spark SQL and DataFrames? Are they same? 2) Documentation says SchemaRDD is renamed as DataFrame. This means SchemaRDD is not existing in 1.3? 3) As per

Re: Dataframes Question

2015-04-18 Thread Abhishek R. Singh
I am no expert myself, but from what I understand DataFrame is grandfathering SchemaRDD. This was done for API stability as spark sql matured out of alpha as part of 1.3.0 release. It is forward looking and brings (dataframe like) syntax that was not available with the older schema RDD. On

Spark Cassandra Connector

2015-04-18 Thread DStrip
Hello, I am facing some difficulties on installing the Cassandra Spark connector. Specifically I am working on Cassandra 2.0.13 and Spark 1.2.1. I am trying to build the-create the JAR- for the connection but unfortunately I cannot see in which file I have to declare the dependencies

Re: Spark Code to read RCFiles

2015-04-18 Thread Pramod Biligiri
Hi, I remember seeing a similar performance problem with Apache Shark last year when compared to Hive, though that was in a company specific port of the code. Unfortunately I no longer have access to that code. The problem then was reflection based class creation in the critical path of reading

spark application was submitted twice unexpectedly

2015-04-18 Thread Pengcheng Liu
Hi guys, I was trying to submit several spark applications to a standalone cluster at the same time from a shell script. One issue I met is sometimes one application may be submitted to cluster twice unexpectedly(from web UI, I can see the two applications of same name were generated exactly at

Re: Can't get SparkListener to work

2015-04-18 Thread Archit Thakur
Hi Praveen, Can you try once removing throw exception in map. Do you still not get it.? On Apr 18, 2015 8:14 AM, Praveen Balaji secondorderpolynom...@gmail.com wrote: Thanks for the response, Imran. I probably chose the wrong methods for this email. I implemented all methods of SparkListener

Re: Actor not found

2015-04-18 Thread Zhihang Fan
Hi, Shixiong: Actually, I know nothing about this exception. I submitted a job that would read about 2.5T data and it threw this exception. And also I try to submit some jobs that can run successfully before this submission, it also failed with the same exception. Hope this will help you to

Re: Custom partioner

2015-04-18 Thread Archit Thakur
Yes you can. Use partitionby method and pass partitioner to it. On Apr 17, 2015 8:18 PM, Jeetendra Gangele gangele...@gmail.com wrote: Ok is there a way, I can use hash Partitioning so that I can improve the performance? On 17 April 2015 at 19:33, Archit Thakur archit279tha...@gmail.com

MLlib -Collaborative Filtering

2015-04-18 Thread riginos
Is there any way that i can see the similarity table of 2 users in that algorithm? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22552.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: When querying ElasticSearch, score is 0

2015-04-18 Thread Nick Pentreath
ES-hadoop uses a scan scroll search to efficiently retrieve large result sets. Scores are not tracked in a scan and sorting is not supported hence 0 scores. http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan — Sent from Mailbox

Re: MLlib -Collaborative Filtering

2015-04-18 Thread Nick Pentreath
What do you mean by similarity table of 2 users? Do you mean the similarity between 2 users? — Sent from Mailbox On Sat, Apr 18, 2015 at 11:09 AM, riginos samarasrigi...@gmail.com wrote: Is there any way that i can see the similarity table of 2 users in that algorithm? -- View this

Re: When querying ElasticSearch, score is 0

2015-04-18 Thread Andrejs Abele
Thank you for the information. Cheers, Andrejs On 04/18/2015 10:23 AM, Nick Pentreath wrote: ES-hadoop uses a scan scroll search to efficiently retrieve large result sets. Scores are not tracked in a scan and sorting is not supported hence 0 scores.