Hello,
To convert existing Map Reduce jobs to Spark, I need to implement window
functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
FIRST_VALUE function:
Source (1st column is key):
A, A1
A, A2
A, A3
B, B1
B, B2
C, C1
and the result should be
A, A1, A1
A, A2, A1
A, A3, A1
B, B1,
Hi all,
I am trying to come up with a workflow where I can query streams
asynchronously. The problem I have is a ssc.awaitTermination() line blocks
the whole thread, so it is not straightforward to me whether it is possible
to get hold of objects from streams once they are started. any suggestion
You call awaitTermination() in the main thread, and indeed it blocks
there forever. From there Spark Streaming takes over, and is invoking
the operations you set up. Your operations have access to the data of
course. That's the model; you don't make external threads that reach
in to Spark
Frankly no good/standard way to visualize streaming data. So far I have
found HBase as good intermediate store to store data from streams
visualize it by a play based framework d3.js.
Regards
Mayur
On Fri Feb 13 2015 at 4:22:58 PM Kevin (Sangwoo) Kim kevin...@apache.org
wrote:
I'm not very
Hi,
is SparkSQL + Parquet suitable to replicate a star schema ?
Paolo Platter
AgileLab CTO
A number of comments:
310GB is probably too large for an executor. You probably want many
smaller executors per machine. But this is not your problem.
You didn't say where the OutOfMemoryError occurred. Executor or driver?
Tuple2 is a Scala type, and a general type. It is appropriate for
Hello,
I have the following scenario and was wondering if I can use Spark to
address it.
I want to query two different data stores (say, ElasticSearch and MySQL)
and then merge the two result sets based on a join key between the two. Is
it appropriate to use Spark to do this join, if the
It is a wrapper whose API is logically the same, but whose method
signature make more sense in Java. You can call the Scala API in Java
without too much trouble, but it gets messy when you have to manually
grapple with ClassTag from Java for example.
There is not an implicit conversion since it
I'm not very sure for CDH 5.3,
but now Zeppelin works for Spark 1.2 as spark-repl has been published in
Spark 1.2.1
Please try again!
On Fri Feb 13 2015 at 3:55:19 PM Su She suhsheka...@gmail.com wrote:
Thanks Kevin for the link, I have had issues trying to install zeppelin as
I believe it is
Hello,
In Spark programming guide
(http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a
recommendation:
Typically you want 2-4 partitions for each CPU in your cluster.
We have a Spark Master and two Spark workers each with 18 cores and 18 GB of
RAM.
In our application we use
18 cores or 36? doesn't probably matter.
For this case where you have some overhead per partition of setting up
the DB connection, it may indeed not help to chop up the data more
finely than your total parallelism. Although that would imply quite an
overhead. Are you doing any other expensive
Get it. Thanks Reynold and Andrew!
Jianshi
On Thu, Feb 12, 2015 at 12:25 AM, Andrew Or and...@databricks.com wrote:
Hi Jianshi,
For YARN, there may be an issue with how a recently patch changes the
accessibility of the shuffle files by the external shuffle service:
Does that mean partitioning does not work in Python? Or does this only
effect joining?
On 2015-02-12 19:27, Davies Liu wrote:
The feature works as expected in Scala/Java, but not implemented in
Python.
On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid iras...@cloudera.com
wrote:
I wonder if
Thank you for clarification, Sean.
2015-02-13 14:16 GMT+04:00 Sean Owen so...@cloudera.com:
It is a wrapper whose API is logically the same, but whose method
signature make more sense in Java. You can call the Scala API in Java
without too much trouble, but it gets messy when you have to
Thanks for the reply, I am trying to setup a streaming as a service
approach, using the framework that is used for spark-jobserver. for that I
would need to handle asynchronous operations that are initiated from
outside of the stream. Do you think it is not possible?
On Fri Feb 13 2015 at
Sure it's possible, but you would use Streaming to update some shared
state, and create another service that accessed that shared state too.
On Fri, Feb 13, 2015 at 11:57 AM, Tamas Jambor jambo...@gmail.com wrote:
Thanks for the reply, I am trying to setup a streaming as a service
approach,
In
https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38,
wouldn't it help to change the lines
vs = rdd.map(lambda (k, v): (k, (1, v)))
ws = other.map(lambda (k, v): (k, (2, v)))
to
vs = rdd.mapValues(lambda v: (1, v))
ws = other.mapValues(lambda v: (2, v))
Hi,
can some one guide how to get SQL Exception trapped for query executed using
SchemaRDD,
i mean suppose table not found
thanks in advance,
--
View this message in context:
Use below configuration if u r using 1.2 version:-
SET spark.shuffle.consolidateFiles=true;
SET spark.rdd.compress=true;
SET spark.default.parallelism=1000;
SET spark.deploy.defaultCores=54;
Thanks
Puneet.
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Friday,
Hi all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using
You can also export the variable SPARK_PRINT_LAUNCH_COMMAND before launching a
spark-submit command to display the java command that will be launched, e.g.:
export SPARK_PRINT_LAUNCH_COMMAND=1
/opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --class
kelkoo.SparkAppTemplate --jars
Hello,
I was trying the streaming kmeans clustering example in the official
documentation at:
http://spark.apache.org/docs/1.2.0/mllib-clustering.html
But I've got a type error when I tried to compile the code:
[error] found :
Except that transformations don't have an exactly-once guarantee, so this
way of doing counters may produce different answers across various forms of
failures and speculative execution.
On Fri, Feb 13, 2015 at 8:56 AM, Sean McNamara sean.mcnam...@webtrends.com
wrote:
.map is just a
Agree, it's correct in the linear methods doc page too. You can open a
PR for this simple 'typo' fix; I imagine it just wasn't updated or
fixed with the others.
On Fri, Feb 13, 2015 at 4:26 PM, Emre Sevinc emre.sev...@gmail.com wrote:
Hello,
I was trying the streaming kmeans clustering example
I am trying to implement counters in Spark and I guess Accumulators are the
way to do it.
My motive is to update a counter in map function and access/reset it in the
driver code. However the /println/ statement at the end still yields value
0(It should 9). Am I doing something wrong?
def
.map is just a transformation, so no work will actually be performed until
something takes action against it. Try adding a .count(), like so:
inputRDD.map { x = {
counter += 1
} }.count()
In case it is helpful, here are the docs on what exactly the transformations
and actions are:
I'm playing around with Spark on Windows and have to worker nodes running by
starting them manually using a script that contains the following
set SPARK_HOME=C:\dev\programs\spark-1.2.0
set SPARK_MASTER_IP=master.brad.com
spark-class org.apache.spark.deploy.worker.Worker
Hi,
I believe SizeOf.jar may calculate the wrong size for you.
Spark has a util call SizeEstimator located in
org.apache.spark.util.SizeEstimator. And some one extracted it out in
I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.
rdd1: (id1, Long text 1), (id2, Long text 2), (id3, Long text 3)
rdd2: (id1, Long text 1 A), (id2, Long text 2 A)
rdd3: (id1, Long text 1 B)
Then, I want to merge all RDDs. If there is duplicated docids, later
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the
following exception when the size of the broadcast variable exceeds 2GB. Any
ideas on how I can resolve this issue?
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at
reducebyKey should work, but you need to define the ordering by using some
sort of index.
On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
I have multiple RDD[(String, String)] that store (docId, docText) pairs,
e.g.
rdd1: (“id1”, “Long text
I think you've hit the nail on the head. Since the serialization
ultimately creates a byte array, and arrays can have at most ~2
billion elements in the JVM, the broadcast can be at most ~2GB.
At that scale, you might consider whether you really have to broadcast
these values, or want to handle
yeah I thought the same thing at first too, I suggested something
equivalent w/ preservesPartitioning = true, but that isn't enough. the
join is done by union-ing the two transformed rdds, which is very different
from the way it works under the hood in scala to enable narrow
dependencies. It
unfortunately this is a known issue:
https://issues.apache.org/jira/browse/SPARK-1476
as Sean suggested, you need to think of some other way of doing the same
thing, even if its just breaking your one big broadcast var into a few
smaller ones
On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen
Try using `backticks` to escape non-standard characters.
On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote:
I don't remember Oracle ever enforcing that I couldn't include a $ in a
column name, but I also don't thinking I've ever tried.
When using sqlContext.sql(...), I
Hi guys,
Probably a dummy question. Do you know how to compile Spark 0.9 to easily
integrate with HDFS 2.6.0 ?
I was trying
sbt/sbt -Pyarn -Phadoop-2.6 assembly
ormvn -Dhadoop.version=2.6.0 -DskipTests clean package
but none of these approaches succeeded.
Thanks,Robert
I don't remember Oracle ever enforcing that I couldn't include a $ in a
column name, but I also don't thinking I've ever tried.
When using sqlContext.sql(...), I have a SELECT * from myTable WHERE
locations_$homeAddress = '123 Elm St'
It's telling me $ is invalid. Is this a bug?
Thanks Michael for the pointer Sorry for the delayed reply.
Taking a quick inventory of scope of change - Is the column type for
Decimal caching needed only in the caching layer (4 files
in org.apache.spark.sql.columnar - ColumnAccessor.scala,
ColumnBuilder.scala, ColumnStats.scala,
Shark's in-memory code was ported to Spark SQL and is used by default when
you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data
better.
On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf nightwolf...@gmail.com wrote:
Hi all,
I'd like
I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour
appears to have changed.
My launch script is
spark-1.2.0-bin-hadoop2.4/ec2/spark-ec2 --instance-type=m3.xlarge -s 5
--ebs-vol-size=1000 launch myproject
When I ssh into master I get:
$ df -h
FilesystemSize
I am trying to run BlinkDB(https://github.com/sameeragarwal/blinkdb) which
seems to work only with Spark 0.9. However, if I want to access HDFS I need to
compile Spark against Hadoop version which is running on my cluster(2.6.0).
Hence, the versions problem ...
On Friday, February 13,
You cannot have two Spark Contexts in the same JVM active at the same time.
Just create one SparkContext and then use it for both purpose.
TD
On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN
johnfedrickena...@gmail.com wrote:
Can you try creating just a single spark context and then try
this is more-or-less the best you can do now, but as has been pointed out,
accumulators don't quite fit the bill for counters. There is an open issue
to do something better, but no progress on that so far
https://issues.apache.org/jira/browse/SPARK-603
On Fri, Feb 13, 2015 at 11:12 AM, Mark
If you just need standalone mode, you don't need -Pyarn. There is no
-Phadoop-2.6; you should use -Phadoop-2.4 for 2.4+. Yes, set
-Dhadoop.version=2.6.0. That should be it.
If that still doesn't work, define doesn't succeed.
On Fri, Feb 13, 2015 at 7:13 PM, Grandl Robert
Here is an example of how you can do. Lets say myDStream contains the
data that you may want to asynchornously query, say using, Spark SQL.
val sqlContext = new SqlContext(streamingContext.sparkContext)
myDStream.foreachRDD { rdd = // rdd is a RDD of case class
OK, from scanning the pom.xml, I think you would try:
-Pyarn -Dhadoop.version=2.6.0
If it doesn't package or pass tests, then I'd assume it's not supported :(
On Fri, Feb 13, 2015 at 7:33 PM, Grandl Robert rgra...@yahoo.com wrote:
I am trying to run
Looks like this is caused by issue SPARK-5008:
https://issues.apache.org/jira/browse/SPARK-5008
On 13 February 2015 at 19:04, Joe Wass jw...@crossref.org wrote:
I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour
appears to have changed.
My launch script is
Do you mean first union all RDDs together and then do a reduceByKey()? Suppose
my unioned RDD is
rdd : (“id1”, “text 1”), (“id1”, “text 2”), (“id1”, “text 3”)
How can I use reduceByKey to return (“id1”, “text 3”) ? I mean to take the
last one entry for the same key
Code snippet is
Thanks Sean for your prompt response.
I was trying to compile as following:
mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package
but I got a bunch of errors(see below). Hadoop-2.6.0 compiled correctly, and
all hadoop jars are in .m2 repository.
Do you have any idea what might
Oh right, you said Spark 0.9. Those profiles won't exist back then. I
don't even know if Hadoop 2.6 will work with 0.9 as-is. The profiles
were introduced later to fix up some compatibility. Why not use 1.2.1?
On Fri, Feb 13, 2015 at 7:26 PM, Grandl Robert rgra...@yahoo.com wrote:
Thanks Sean
Thanks Sean and Imran,
I'll try splitting the broadcast variable into smaller ones.
I had tried a regular join but it was failing due to high garbage
collection overhead during the shuffle. One of the RDDs is very large
and has a skewed distribution where a handful of keys account for 90%
of the
Hello,
I was looking at GraphX as I believe it can be useful in my research on
temporal data and I had a number of questions about the system:
1) How do you actually run programs in GraphX? At the moment I've been doing
everything live through the shell, but I'd obviously like to be able to
Thanks Akhil for the suggestion, it is now only giving me one part - .
Is there anyway I can just create a file rather than a directory? It
doesn't seem like there is just a saveAsTextFile option for
JavaPairRecieverDstream.
Also, for the copy/merge api, how would I add that to my spark job?
This is fixed in 1.2.1, could you upgrade to 1.2.1?
On Thu, Feb 12, 2015 at 4:55 AM, Rok Roskar rokros...@gmail.com wrote:
Hi again,
I narrowed down the issue a bit more -- it seems to have to do with the Kryo
serializer. When I use it, then this results in a Null Pointer:
rdd =
i wrote a proof of concept to automatically store any RDD of tuples or case
classes in columar format using arrays (and strongly typed, so you get the
benefit of primitive arrays). see:
https://github.com/tresata/spark-columnar
On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust
Per the documentation:
It is important to make sure that the structure of every Row of the
provided RDD matches the provided schema. Otherwise, there will be runtime
exception.
However, it appears that this is not being enforced.
import org.apache.spark.sql._
val sqlContext = new
I have not run the following, but will be on these lines -
rdd.zipWithIndex().map(x = (x._1._1, (x._1._2, x._2))).reduceByKey((a, b)
= { if(a._2 b._2) a else b }).map(x = (x._1, x._2._1))
On Fri, Feb 13, 2015 at 3:27 PM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Do you mean
Hi Justin,
It is expected. We do not check if the provided schema matches rows since
all rows need to be scanned to give a correct answer.
Thanks,
Yin
On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony justin.pih...@gmail.com
wrote:
Per the documentation:
It is important to make sure that
OK, but what about on an action, like collect()? Shouldn't it be able to
determine the correctness at that time?
On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai yh...@databricks.com wrote:
Hi Justin,
It is expected. We do not check if the provided schema matches rows since
all rows need to be
This doesn't seem to have helped.
On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust mich...@databricks.com
wrote:
Try using `backticks` to escape non-standard characters.
On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet cjno...@gmail.com wrote:
I don't remember Oracle ever enforcing that I
Nevermind- I think I may have had a schema-related issue (sometimes
booleans were represented as string and sometimes as raw booleans but when
I populated the schema one or the other was chosen.
On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet cjno...@gmail.com wrote:
Here are the results of a
I'm having the same problem with the same sample code. Any progress on this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkException-Task-not-serializable-Jackson-Json-tp21347p21651.html
Sent from the Apache Spark User List mailing list archive at
Here are the results of a few different SQL strings (let's assume the
schemas are valid for the data types used):
SELECT * from myTable where key1 = true - no filters are pushed to my
PrunedFilteredScan
SELECT * from myTable where key1 = true and key2 = 5 - 1 filter (key2) is
pushed to my
At 2015-02-13 12:19:46 -0800, Matthew Bucci mrbucci...@gmail.com wrote:
1) How do you actually run programs in GraphX? At the moment I've been doing
everything live through the shell, but I'd obviously like to be able to work
on it by writing and running scripts.
You can create your own
Thanks for Ye Xianjin's suggestions.
The SizeOf.jar may indeed have some problems. I did a simple test as
follows. The codes are
val n = 1; //5; //10; //100; //1000;
val arr1 = new Array[(Int, Array[Int])](n);
for(i - 0 until arr1.length){
arr1(i) = (i, new
65 matches
Mail list logo