import itertools
l = [1,1,1,2,2,3,4,4,5,1]
gs = itertools.groupby(l)
map(lambda (n, it): (n, sum(1 for _ in it)), gs)
[(1, 3), (2, 2), (3, 1), (4, 2), (5, 1), (1, 1)]
def groupCount(l):
gs = itertools.groupby(l)
return map(lambda (n, it): (n, sum(1 for _ in it)), gs)
If you have an
What happens when a run of numbers is spread across a partition boundary?
I think you might end up with two adjacent groups of the same value in
that situation.
On Mon, Aug 18, 2014 at 2:05 AM, Davies Liu dav...@databricks.com wrote:
import itertools
l = [1,1,1,2,2,3,4,4,5,1]
gs =
On Sun, Aug 17, 2014 at 11:07 PM, Andrew Ash and...@andrewash.com wrote:
What happens when a run of numbers is spread across a partition boundary? I
think you might end up with two adjacent groups of the same value in that
situation.
Yes, need another scan to combine this continuous groups
Another option is using Tachyon to cache the RDD, then the cache can
be shared by different applications. See how to use Spark with
Tachyon: http://tachyon-project.org/Running-Spark-on-Tachyon.html
Davies
On Sun, Aug 17, 2014 at 4:48 PM, ryaminal tacmot...@gmail.com wrote:
You can also look
Hi Ghousia,
You can try the following:
1. Increase the heap size
https://spark.apache.org/docs/0.9.0/configuration.html
2. Increase the number of partitions
http://stackoverflow.com/questions/21698443/spark-best-practice-for-retrieving-big-data-from-rdd-to-local-machine
3. You could try
Hello:
I am trying to setup Spark to connect to a Hive table which is backed by
HBase, but I am running into the following NullPointerException:
scala val hiveCount = hiveContext.sql(select count(*) from
dataset_records).collect().head.getLong(0)
14/08/18 06:34:29 INFO ParseDriver: Parsing
Hi All,
I'm new to Spark and Scala, just recently using this language and love it, but
there is a small coding problem when I want to convert my existing map reduce
code from Java to Spark...
In Java, I create a class by extending org.apache.hadoop.mapreduce.Mapper and
override the setup(),
Thanks for the answer Akhil. We are right now getting rid of this issue by
increasing the number of partitions. And we are persisting RDDs to
DISK_ONLY. But the issue is with heavy computations within an RDD. It would
be better if we have the option of spilling the intermediate transformation
Looks like your hiveContext is null. Have a look at this documentation.
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
Thanks
Best Regards
On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo ce...@zephyrhealthinc.com
wrote:
Hello:
I am trying to setup Spark to
Hi All,
Please ignore my question, I found a way to implement it via old archive mails:
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3CCAF_KkPzpU4qZWzDWUpS5r9bbh=-hwnze2qqg56e25p--1wv...@mail.gmail.com%3E
Best regards,
Henry
From: MA33 YTHung1
Sent: Monday, August 18, 2014
I believe spark.shuffle.memoryFraction is the one you are looking for.
spark.shuffle.memoryFraction : Fraction of Java heap to use for aggregation
and cogroups during shuffles, if spark.shuffle.spill is true. At any given
time, the collective size of all in-memory maps used for shuffles is
You can create an RDD and then you can do a map or mapPartitions on that
where in the top you will create the database connection and all, then do
the operations and at the end close the connections.
Thanks
Best Regards
On Mon, Aug 18, 2014 at 12:34 PM, Henry Hung ythu...@winbond.com wrote:
I think this was a more comprehensive answer recently. Tobias is right
that it is not quite that simple:
http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/%3CCAPH-c_O9kQO6yJ4khXUVdO=+D4vj=JfG2tP9eqn5RPko=dr...@mail.gmail.com%3E
On Mon, Aug 18, 2014 at 8:04 AM, Henry Hung
Nope, it is NOT null. Check this out:
scala hiveContext == null
res2: Boolean = false
And thanks for sending that link, but I had already looked at it. Any other
ideas?
I looked through some of the relevant Spark Hive code and I'm starting to
think this may be a bug.
-Cesar
On Mon, Aug 18,
I slightly modify the code to use while(partitions.hasNext) { } instead of
partitions.map(func)
I suppose this can eliminate the uncertainty from lazy execution.
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Monday, August 18, 2014 3:10 PM
To: MA33 YTHung1
Cc:
Thanks for the reply!
def groupCount(l):
gs = itertools.groupby(l)
return map(lambda (n, it): (n, sum(1 for _ in it)), gs)
If you have an RDD, you can use RDD.mapPartitions(groupCount).collect()
Yes, I am interested in RDD - not pure Python :)
I am new to Spark, can you explain:
-
Then definitely its a jar conflict. Can you try removing this jar from the
class path /opt/spark-poc/lib_managed/jars/org.spark-project.hive/hive-exec/
hive-exec-0.12.0.jar
Thanks
Best Regards
On Mon, Aug 18, 2014 at 12:45 PM, Cesar Arevalo ce...@zephyrhealthinc.com
wrote:
Nope, it is NOT
Thanks, I got it !
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-why-need-a-masterLock-when-sending-heartbeat-to-master-tp12256p12297.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've
met before. It takes a dozen minutes to read the metadata because the
ParquetInputFormat tries to call geFileStatus for all part-files
sequentially.
Just checked SequenceFileInputFormat, and found that a MapFile may share
That helps a lot.
Thanks.
Zhanfeng Huo
From: Davies Liu
Date: 2014-08-18 14:31
To: ryaminal
CC: u...@spark.incubator.apache.org
Subject: Re: application as a service
Another option is using Tachyon to cache the RDD, then the cache can
be shared by different applications. See how to use
Hi All.
I need to create a lot of RDDs starting from a set of roots and count the
rows in each. Something like this:
final JavaSparkContext sc = new JavaSparkContext(conf);
ListString roots = ...
MapString, Object res = sc.parallelize(roots).mapToPair(new
PairFunctionString, String, Long(){
You won't be able to use RDDs inside of RDD operation. I imagine your
immediate problem is that the code you've elided references 'sc' and
that gets referenced by the PairFunction and serialized, but it can't
be.
If you want to play it this way, parallelize across roots in Java.
That is just use
hi all,
In RDD map , i invoke an object that is *Serialized* by java standard ,
and exception ::
com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 13
at com.esotericsoftware.kryo.io.Output.require(Output.java:138)
at
Based on my understanding something like this doesn't seem to be possible out
of the box, but I thought I would write it up anyway in case someone has any
ideas.
We have conceptually one high volume input stream, each streaming job is either
interested in a subset of the stream or the entire
Hello Mayur,
#3 in the new RangePartitioner(*3*, partitionedFile); is also a hard coded
value for the number of partitions. Do you find a way where i can avoid
that. And including the cluster size, partitions depends on the input data
size also. Correct me if i am wrong.
--
View this message
Hi,I was able to set this parameter in my application to resolve this issue:
set(spark.kryoserializer.buffer.mb, 256)
Please let me know if this helps.
Date: Mon, 18 Aug 2014 21:50:02 +0800
From: dujinh...@hzduozhun.com
To: user@spark.apache.org
Subject: spark kryo serilizable exception
I removed the JAR that you suggested but now I get another error when I try
to create the HiveContext. Here is the error:
scala val hiveContext = new HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to
term ql
in package org.apache.hadoop.hive which is not
I'm curious to see that if you declare broadcasted wrapper as a var, and
overwrite it in the driver program, the modification can have stable impact
on all transformations/actions defined BEFORE the overwrite but was executed
lazily AFTER the overwrite:
val a = sc.parallelize(1 to 10)
var
rdd.flatMap(lambda x:x) maybe could solve your problem, it will
convert an RDD from
[[[1,2,3],[4,5,6]],[[7,8,9,],[10,11,12]]]
into:
[[1,2,3], [4,5,6], [7,8,9,], [10,11,12]]
On Mon, Aug 18, 2014 at 2:42 AM, Chengi Liu chengi.liu...@gmail.com wrote:
I have an rdd in pyspark which looks like
I followed your instructions to try to load data as parquet format through
hiveContext but failed. Do you happen to know my uncorrectness in the
following steps?
The steps I am following is like:
1. download parquet-hive-bundle-1.5.0.jar
2. revise hive-site.xml including this:
property
Hey all,
I’m trying to run connected components in graphx on about 400GB of data on 50
m3.xlarge nodes on emr. I keep getting java.nio.channels.CancelledKeyException
when it gets to mapPartitions at VertexRDD.scala:347”. I haven’t been able to
find much about this online, and nothing that
Hi,
I have a piece of code in which the result of a groupByKey operation is as
follows:
(2013-04, ArrayBuffer(s1, s2, s3, s1, s2, s4))
The first element is a String value representing a date and the ArrayBuffer
consists of (non-unique) strings. I want to extract the unique elements of
the
The Spark Job that has the main DStream, could have another DStream that is
listening for stream subscription requests. So when a subscription is
received, you could do a filter/forEach on the main DStream and respond to
that one request. So you're basically creating a stream server that is
Well, it looks like I can use the .repartition(1) method to stuff everything
in one partition so that gets rid of the duplicate messages I send to
RabbitMQ but that seems like a bad idea perhaps. Wouldn't that hurt
scalability?
--
View this message in context:
Hi,
Mine data source stores the incoming data every 10 second to hdfs. The
naming convention save-timestamp.csv (see below)
drwxr-xr-x ali supergroup 0 B 0 0 B save-1408396065000.csv
drwxr-xr-x ali supergroup 0 B 0 0 B save-140839607.csv
drwxr-xr-x
Looks like hbaseTableName is null, probably caused by incorrect configuration.
String hbaseTableName = jobConf.get(HBaseSerDe.HBASE_TABLE_NAME);
setHTable(new HTable(HBaseConfiguration.create(jobConf),
Bytes.toBytes(hbaseTableName)));
Here is the definition.
public static final
What's the correct way to use setCallSite to get the change to show up in
the spark logs?
I have something like
class RichRDD (rdd : RDD[MyThing]) {
def mySpecialOperation() {
rdd.context.setCallSite(bubbles and candy!)
rdd.map()
val result = rdd.groupBy()
On Mon, Aug 18, 2014 at 6:29 AM, Yifan LI iamyifa...@gmail.com wrote:
I am testing our application(similar to personalised page rank using
Pregel, and note that each vertex property will need pretty much more space
to store after new iteration)
[...]
But when we ran it on larger graph(e.g.
I think the behavior is by designed. Because if b is not persisted, and in each
call b.collect, broadcasted has point to a new broadcasted variable, serialized
by driver, and fetched by executors.
If you do persist, you don’t expect the RDD get changed due to new broadcasted
variable.
Thanks.
I want to send an HTTP request (specifically to OpenTSDB) to get data. I've
been looking at the StreamingContext api and don't seem to see any methods
that can connect to this. Has anyone tried connecting Spark Streaming to a
server via HTTP requests before? How have you done it?
--
View this
Thanks, Zhan for the follow up.
But, do you know how I am supposed to set that table name on the jobConf? I
don't have access to that object from my client driver?
--
View this message in context:
Hello,
I have an HA enabled YARN cluster with two resource mangers. When submitting
jobs via “spark-submit —master yarn-cluster”. It appears that the driver is
looking explicitly for the yarn.resourcemanager.address” property rather than
round robin-ing through the resource managers via the
Hi John,
It seems like original problem you had was that you were initializing the
RabbitMQ connection on the driver, but then calling the code to write to
RabbitMQ on the workers (I'm guessing, but I don't know since I didn't see
your code). That's definitely a problem because the connection
Oh sorry, just to be more clear - writing from the driver program is only
safe if the amount of data you are trying to write is small enough to fit
on memory in the driver program. I looked at your code, and since you are
just writing a few things each time interval, this seems safe.
-Vida
On
Hi,
I have a piece of code that reads all the (csv) files in a folder. For each
file, it parses each line, extracts the first 2 elements from each row of
the file, groups the tuple by the key and finally outputs the number of
unique values for each key.
val conf = new
We are prototyping an application with Spark streaming and Kinesis. We use
kinesis to accept incoming txn data, and then process them using spark
streaming. So far we really liked both technologies, and we saw both
technologies are getting mature rapidly. We are almost settled to use these
two
You need to create a custom receiver that submits the HTTP requests then
deserializes the data and pushes it into the Streaming context.
See here for an example:
http://spark.apache.org/docs/latest/streaming-custom-receivers.html
On 8/18/14, 6:20 PM, bumble123 tc1...@att.com wrote:
I want to
First the JAR needs to be deployed using the ‹jars argument. Then in your
HQL code you need to use the DeprecatedParquetInputFormat and
DeprecatedParquetOutputFormat as described here
https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0
.12
This is because SparkSQL is based
fil wrote
- Python functions like groupCount; these get reflected from their Python
AST and converted into a Spark DAG? Presumably if I try and do something
non-convertible this transformation process will throw an error? In other
words this runs in the JVM.
Further to this - it seems that
Hi Wei,
On Tue, Aug 19, 2014 at 10:18 AM, Wei Liu wei@stellarloyalty.com
wrote:
Since our application cannot tolerate losing customer data, I am wondering
what is the best way for us to address this issue.
1) We are thinking writing application specific logic to address the data
loss. To
I'm using CDH 5.1 with spark 1.0.
When I try to run Spark SQL following the Programming Guide
val parquetFile = sqlContext.parquetFile(path)
If the path is a file, it throws an exception:
Exception in thread main java.lang.IllegalArgumentException:
Expected hdfs://*/file.parquet for be a
On Mon, Aug 18, 2014 at 7:41 PM, fil f...@pobox.com wrote:
fil wrote
- Python functions like groupCount; these get reflected from their Python
AST and converted into a Spark DAG? Presumably if I try and do something
non-convertible this transformation process will throw an error? In other
Hi,
Mine data source stores the incoming data every 10 second to hdfs. The
naming convention save-timestamp.csv (see below)
drwxr-xr-x ali supergroup 0 B 0 0 B save-1408396065000.csv
drwxr-xr-x ali supergroup 0 B 0 0 B save-140839607.csv
drwxr-xr-x
Yeah, Thanks a lot. I know for people understanding lazy execution this seems
straightforward. But for those who don't it may become a liability.
I've only tested its stability on a small example (which seems stable),
hopefully it's not a serendipity. Can a committer confirm this?
Yours Peng
I think Currently Spark Streaming lack a data acknowledging mechanism when data
is stored and replicated in BlockManager, so potentially data will be lost even
pulled into Kafka, say if data is stored just in BlockGenerator not BM, while
in the meantime Kafka itself commit the consumer offset,
Dear All,
Recently I have written a Spark Kafka Consumer to solve this problem. Even
we have seen issues with KafkaUtils which is using Highlevel Kafka Consumer
and consumer code has no handle to offset management.
The below code solves this problem, and this has is being tested in our
Spark
OK I tried your build -
First you need to put spt in C:\sbt
Then you get
Microsoft Windows [Version 6.2.9200]
(c) 2012 Microsoft Corporation. All rights reserved.
e:\which java
/cygdrive/c/Program Files/Java/jdk1.6.0_25/bin/java
e:\java -version
java version 1.6.0_25
Java(TM) SE Runtime
Thank you all for responding to my question. I am pleasantly surprised by
this many prompt responses I got. It shows the strength of the spark
community.
Kafka is still an option for us, I will check out the link provided by
Dibyendu.
Meanwhile if someone out there already figured this out with
Hmm I thought as much. I am using Cassandra with the Spark connector. What
I really need is a RDD created from a query against Cassandra of the form
where partition_key = :id where :id is taken from a list. Some grouping
of the ids would be a way to partition this.
On Mon, Aug 18, 2014 at 3:42
59 matches
Mail list logo