Can you show us the function you passed to reduceByKey() ?
What release of Spark are you using ?
Cheers
On Sat, Apr 18, 2015 at 8:17 AM, SecondDatke lovejay-lovemu...@outlook.com
wrote:
I'm trying to solve a Word-Count like problem, the difference lies in
that, I need the count of a specific
Hi
I want to consume messages from kafka queue using spark batch program not
spark streaming, Is there any way to achieve this, other than using low
level(simple api) of kafka consumer.
Thanks
Is there any way that i can see the similarity table of 2 users in that
algorithm? by that i mean the similarity between 2 users
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22553.html
Sent from the Apache Spark User List
Use KafkaRDD directly. It is in spark-streaming-kafka package
On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Hi
I want to consume messages from kafka queue using spark batch program not
spark streaming, Is there any way to achieve this, other than using low
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from
HDFS.
Sent with Good (www.good.com)
-Original Message-
From: Shushant Arora
[shushantaror...@gmail.commailto:shushantaror...@gmail.com]
Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time
To:
I'm trying to solve a Word-Count like problem, the difference lies in that, I
need the count of a specific word among a specific timespan in a social message
stream.
My data is in the format of (time, message), and I transformed (flatMap etc.)
it into a series of (time, word_id), the time is
Hi,
I am wondering the mechanism that determines the number of partitions
created by SparkContext.sequenceFile ?
For example, although my file has only 4 splits, Spark would create 16
partitions for it. Is it determined by the file size? Is there any way to
control it? (Looks like I can only
Do these datetime objects implement a the notion of equality you'd
expect? (This may be a dumb question; I'm thinking of the equivalent
of equals() / hashCode() from the Java world.)
On Sat, Apr 18, 2015 at 4:17 PM, SecondDatke
lovejay-lovemu...@outlook.com wrote:
I'm trying to solve a
Thanks for the response, Archit. I get callbacks when I do not throw an
exception from map.
My use case, however, is to get callbacks for exceptions in transformations
on executors. Do you think I'm going down the right route?
Cheers
-p
On Sat, Apr 18, 2015 at 1:49 AM, Archit Thakur
The comparison of Python tuple is lexicographical(that is, two tuple equals if
and only if every element is same and in the same position), and though I'm not
very clear with what's going on inside Spark, I have tried comparing the
duplicate keys in the result with == operator, and hashing them
Hi,
My spark job is failing with following error message
org.apache.spark.shuffle.FetchFailedException:
/mnt/ephemeral12/yarn/nm/usercache/abc/appcache/application_1429353954024_1691/spark-local-20150418132335-0723/28/shuffle_3_1_0.index
(No such file or directory)
at
I don't think my experiment is suprising, it's my fault:
To move away from my case, I wrote a test program, which generates data
randomly, and cast the key to string:
import randomimport operator
COUNT = 2COUNT_PARTITIONS = 36LEN = 233
rdd = sc.parallelize(((str(random.randint(1, LEN)), 1)
I find a number of cases where I have an JavaRDD and I wish to transform
the data and depending on a test return 0 or one item (don't suggest a
filter - the real case is more complex). So I currently do something like
the following - perform a flatmap returning a list with 0 or 1 entry
depending
KafkaRDD uses the simple consumer api. and i think you need to handle
offsets yourself, unless things changed since i last looked.
I would do second approach.
On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Thanks !!
I have few more doubts :
Does kafka RDD
I would like to get the file name along with the associated objects so that
I can do further mapping on it.
My code below gives me AvroKey[myObject], NullWritable but I don't know how
to get the file that gave those objects.
sc.newAPIHadoopRDD(job.getConfiguration,
Well, I'm changing the strategy.
(Assuming we have N(words) 10, interval between two time points 1 hour)
* Use only the timestamp as the key, no duplicate key. But it would only reduce
to something like the total number of words in a timespan.
* Key = tuple(timestamp, word_id), duplication.
Looks like the two keys involving (datetime.datetime(2009, 10, 6, 3, 0) in
your first email might have been processed by x86 node and x64 node,
respectively.
Cheers
On Sat, Apr 18, 2015 at 11:09 AM, SecondDatke lovejay-lovemu...@outlook.com
wrote:
I don't think my experiment is suprising,
I mean to say it is simpler in case of failures, restarts, upgrades, etc.
Not just failures.
But they did do a lot of work on streaming from kafka in spark 1.3.x to
make it simpler (streaming simple calls KafkaRDD for every batch if you use
KafkaUtils.createDirectStream), so maybe i am wrong and
You can return an RDD with null values inside, and afterwards filter on
item != null
In scala (or even in Java 8) you'd rather use Option/Optional, and in Scala
they're directly usable from Spark.
Exemple :
sc.parallelize(1 to 1000).flatMap(item = if (item % 2 ==0) Some(item)
else
I have created a bunch of protobuf based parquet files that I want to
read/inspect using Spark SQL. However, I am running into exceptions and not
able to proceed much further:
This succeeds successfully (probably because there is no action yet). I can
also printSchema() and count() without any
Experts,
I have few basic questions on DataFrames vs Spark SQL. My confusion is
more with DataFrames.
1) What is the difference between Spark SQL and DataFrames? Are they same?
2) Documentation says SchemaRDD is renamed as DataFrame. This means
SchemaRDD is not existing in 1.3?
3) As per
I am no expert myself, but from what I understand DataFrame is grandfathering
SchemaRDD. This was done for API stability as spark sql matured out of alpha as
part of 1.3.0 release.
It is forward looking and brings (dataframe like) syntax that was not available
with the older schema RDD.
On
Hello,
I am facing some difficulties on installing the Cassandra Spark connector.
Specifically I am working on Cassandra 2.0.13 and Spark 1.2.1. I am trying
to build the-create the JAR- for the connection but unfortunately I cannot
see in which file I have to declare the dependencies
Hi,
I remember seeing a similar performance problem with Apache Shark last year
when compared to Hive, though that was in a company specific port of the
code. Unfortunately I no longer have access to that code. The problem then
was reflection based class creation in the critical path of reading
Hi guys,
I was trying to submit several spark applications to a standalone cluster at
the same time from a shell script. One issue I met is sometimes one
application may be submitted to cluster twice unexpectedly(from web UI, I
can see the two applications of same name were generated exactly at
Hi Praveen,
Can you try once removing throw exception in map. Do you still not get it.?
On Apr 18, 2015 8:14 AM, Praveen Balaji secondorderpolynom...@gmail.com
wrote:
Thanks for the response, Imran. I probably chose the wrong methods for
this email. I implemented all methods of SparkListener
Hi, Shixiong:
Actually, I know nothing about this exception. I submitted a job that
would read about 2.5T data and it threw this exception. And also I try to
submit some jobs that can run successfully before this submission, it also
failed with the same exception.
Hope this will help you to
Yes you can. Use partitionby method and pass partitioner to it.
On Apr 17, 2015 8:18 PM, Jeetendra Gangele gangele...@gmail.com wrote:
Ok is there a way, I can use hash Partitioning so that I can improve the
performance?
On 17 April 2015 at 19:33, Archit Thakur archit279tha...@gmail.com
Is there any way that i can see the similarity table of 2 users in that
algorithm?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLlib-Collaborative-Filtering-tp22552.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
ES-hadoop uses a scan scroll search to efficiently retrieve large result
sets. Scores are not tracked in a scan and sorting is not supported hence 0
scores.
http://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan
—
Sent from Mailbox
What do you mean by similarity table of 2 users?
Do you mean the similarity between 2 users?
—
Sent from Mailbox
On Sat, Apr 18, 2015 at 11:09 AM, riginos samarasrigi...@gmail.com
wrote:
Is there any way that i can see the similarity table of 2 users in that
algorithm?
--
View this
Thank you for the information.
Cheers,
Andrejs
On 04/18/2015 10:23 AM, Nick Pentreath wrote:
ES-hadoop uses a scan scroll search to efficiently retrieve large
result sets. Scores are not tracked in a scan and sorting is not
supported hence 0 scores.
32 matches
Mail list logo