At 2015-02-13 12:19:46 -0800, Matthew Bucci wrote:
> 1) How do you actually run programs in GraphX? At the moment I've been doing
> everything live through the shell, but I'd obviously like to be able to work
> on it by writing and running scripts.
You can create your own projects that build agai
Thanks for Ye Xianjin's suggestions.
The SizeOf.jar may indeed have some problems. I did a simple test as
follows. The codes are
val n = 1; //5; //10; //100; //1000;
val arr1 = new Array[(Int, Array[Int])](n);
for(i <- 0 until arr1.length){
arr1(i) = (i, new Array[Int](43)
Nevermind- I think I may have had a schema-related issue (sometimes
booleans were represented as string and sometimes as raw booleans but when
I populated the schema one or the other was chosen.
On Fri, Feb 13, 2015 at 8:03 PM, Corey Nolet wrote:
> Here are the results of a few different SQL s
Here are the results of a few different SQL strings (let's assume the
schemas are valid for the data types used):
SELECT * from myTable where key1 = true -> no filters are pushed to my
PrunedFilteredScan
SELECT * from myTable where key1 = true and key2 = 5 -> 1 filter (key2) is
pushed to my Prune
I'm having the same problem with the same sample code. Any progress on this?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkException-Task-not-serializable-Jackson-Json-tp21347p21651.html
Sent from the Apache Spark User List mailing list archive at Nab
This is fixed in 1.2.1, could you upgrade to 1.2.1?
On Thu, Feb 12, 2015 at 4:55 AM, Rok Roskar wrote:
> Hi again,
>
> I narrowed down the issue a bit more -- it seems to have to do with the Kryo
> serializer. When I use it, then this results in a Null Pointer:
>
> rdd = sc.parallelize(range(10)
This doesn't seem to have helped.
On Fri, Feb 13, 2015 at 2:51 PM, Michael Armbrust
wrote:
> Try using `backticks` to escape non-standard characters.
>
> On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet wrote:
>
>> I don't remember Oracle ever enforcing that I couldn't include a $ in a
>> column n
OK, but what about on an action, like collect()? Shouldn't it be able to
determine the correctness at that time?
On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai wrote:
> Hi Justin,
>
> It is expected. We do not check if the provided schema matches rows since
> all rows need to be scanned to give a cor
Hi Justin,
It is expected. We do not check if the provided schema matches rows since
all rows need to be scanned to give a correct answer.
Thanks,
Yin
On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony
wrote:
> Per the documentation:
>
> It is important to make sure that the structure of every
I have not run the following, but will be on these lines -
rdd.zipWithIndex().map(x => (x._1._1, (x._1._2, x._2))).reduceByKey((a, b)
=> { if(a._2 > b._2) a else b }).map(x => (x._1, x._2._1))
On Fri, Feb 13, 2015 at 3:27 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
> Do yo
Per the documentation:
It is important to make sure that the structure of every Row of the
provided RDD matches the provided schema. Otherwise, there will be runtime
exception.
However, it appears that this is not being enforced.
import org.apache.spark.sql._
val sqlContext = new SqlContext(s
i wrote a proof of concept to automatically store any RDD of tuples or case
classes in columar format using arrays (and strongly typed, so you get the
benefit of primitive arrays). see:
https://github.com/tresata/spark-columnar
On Fri, Feb 13, 2015 at 3:06 PM, Michael Armbrust
wrote:
> Shark's i
Thanks Akhil for the suggestion, it is now only giving me one part - .
Is there anyway I can just create a file rather than a directory? It
doesn't seem like there is just a saveAsTextFile option for
JavaPairRecieverDstream.
Also, for the copy/merge api, how would I add that to my spark job?
You cannot have two Spark Contexts in the same JVM active at the same time.
Just create one SparkContext and then use it for both purpose.
TD
On Fri, Feb 6, 2015 at 8:49 PM, VISHNU SUBRAMANIAN <
johnfedrickena...@gmail.com> wrote:
> Can you try creating just a single spark context and then try
Do you mean first union all RDDs together and then do a reduceByKey()? Suppose
my unioned RDD is
rdd : (“id1”, “text 1”), (“id1”, “text 2”), (“id1”, “text 3”)
How can I use reduceByKey to return (“id1”, “text 3”) ? I mean to take the
last one entry for the same key
Code snippet is appreciated
Hello,
I was looking at GraphX as I believe it can be useful in my research on
temporal data and I had a number of questions about the system:
1) How do you actually run programs in GraphX? At the moment I've been doing
everything live through the shell, but I'd obviously like to be able to work
Shark's in-memory code was ported to Spark SQL and is used by default when
you run .cache on a SchemaRDD or CACHE TABLE.
I'd also look at parquet which is more efficient and handles nested data
better.
On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf wrote:
> Hi all,
>
> I'd like to build/use column
Looks like this is caused by issue SPARK-5008:
https://issues.apache.org/jira/browse/SPARK-5008
On 13 February 2015 at 19:04, Joe Wass wrote:
> I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour
> appears to have changed.
>
> My launch script is
>
> spark-1.2.0-bin-hadoop2
Try using `backticks` to escape non-standard characters.
On Fri, Feb 13, 2015 at 11:30 AM, Corey Nolet wrote:
> I don't remember Oracle ever enforcing that I couldn't include a $ in a
> column name, but I also don't thinking I've ever tried.
>
> When using sqlContext.sql(...), I have a "SELECT *
OK, from scanning the pom.xml, I think you would try:
-Pyarn -Dhadoop.version=2.6.0
If it doesn't package or pass tests, then I'd assume it's not supported :(
On Fri, Feb 13, 2015 at 7:33 PM, Grandl Robert wrote:
> I am trying to run BlinkDB(https://github.com/sameeragarwal/blinkdb) which
> see
Thanks Michael for the pointer & Sorry for the delayed reply.
Taking a quick inventory of scope of change - Is the column type for
Decimal caching needed only in the caching layer (4 files
in org.apache.spark.sql.columnar - ColumnAccessor.scala,
ColumnBuilder.scala, ColumnStats.scala, ColumnType.s
I am trying to run BlinkDB(https://github.com/sameeragarwal/blinkdb) which
seems to work only with Spark 0.9. However, if I want to access HDFS I need to
compile Spark against Hadoop version which is running on my cluster(2.6.0).
Hence, the versions problem ...
On Friday, February 13, 20
Thanks Sean and Imran,
I'll try splitting the broadcast variable into smaller ones.
I had tried a regular join but it was failing due to high garbage
collection overhead during the shuffle. One of the RDDs is very large
and has a skewed distribution where a handful of keys account for 90%
of the
I don't remember Oracle ever enforcing that I couldn't include a $ in a
column name, but I also don't thinking I've ever tried.
When using sqlContext.sql(...), I have a "SELECT * from myTable WHERE
locations_$homeAddress = '123 Elm St'"
It's telling me $ is invalid. Is this a bug?
Here is an example of how you can do. Lets say "myDStream" contains the
data that you may want to asynchornously query, say using, Spark SQL.
val sqlContext = new SqlContext(streamingContext.sparkContext)
myDStream.foreachRDD { rdd => // rdd is a RDD of case class
sqlContext.registerRDDAsTab
Thanks Sean for your prompt response.
I was trying to compile as following:
mvn -Phadoop-2.4 -Dhadoop.version=2.6.0 -DskipTests clean package
but I got a bunch of errors(see below). Hadoop-2.6.0 compiled correctly, and
all hadoop jars are in .m2 repository.
Do you have any idea what might happen
Oh right, you said Spark 0.9. Those profiles won't exist back then. I
don't even know if Hadoop 2.6 will work with 0.9 as-is. The profiles
were introduced later to fix up some compatibility. Why not use 1.2.1?
On Fri, Feb 13, 2015 at 7:26 PM, Grandl Robert wrote:
> Thanks Sean for your prompt res
If you just need standalone mode, you don't need "-Pyarn". There is no
"-Phadoop-2.6"; you should use "-Phadoop-2.4" for 2.4+. Yes, set
-Dhadoop.version=2.6.0. That should be it.
If that still doesn't work, define "doesn't succeed".
On Fri, Feb 13, 2015 at 7:13 PM, Grandl Robert
wrote:
> Hi guys
Hi guys,
Probably a dummy question. Do you know how to compile Spark 0.9 to easily
integrate with HDFS 2.6.0 ?
I was trying
sbt/sbt -Pyarn -Phadoop-2.6 assembly
ormvn -Dhadoop.version=2.6.0 -DskipTests clean package
but none of these approaches succeeded.
Thanks,Robert
I've updated to Spark 1.2.0 and the EC2 and the persistent-hdfs behaviour
appears to have changed.
My launch script is
spark-1.2.0-bin-hadoop2.4/ec2/spark-ec2 --instance-type=m3.xlarge -s 5
--ebs-vol-size=1000 launch myproject
When I ssh into master I get:
$ df -h
FilesystemSize Us
this is more-or-less the best you can do now, but as has been pointed out,
accumulators don't quite fit the bill for counters. There is an open issue
to do something better, but no progress on that so far
https://issues.apache.org/jira/browse/SPARK-603
On Fri, Feb 13, 2015 at 11:12 AM, Mark Hams
unfortunately this is a known issue:
https://issues.apache.org/jira/browse/SPARK-1476
as Sean suggested, you need to think of some other way of doing the same
thing, even if its just breaking your one big broadcast var into a few
smaller ones
On Fri, Feb 13, 2015 at 12:30 PM, Sean Owen wrote:
>
yeah I thought the same thing at first too, I suggested something
equivalent w/ preservesPartitioning = true, but that isn't enough. the
join is done by union-ing the two transformed rdds, which is very different
from the way it works under the hood in scala to enable narrow
dependencies. It real
reducebyKey should work, but you need to define the ordering by using some
sort of index.
On Fri, Feb 13, 2015 at 12:38 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:
>
>
> I have multiple RDD[(String, String)] that store (docId, docText) pairs,
> e.g.
>
>
>
> rdd1: (“id1”, “
I think you've hit the nail on the head. Since the serialization
ultimately creates a byte array, and arrays can have at most ~2
billion elements in the JVM, the broadcast can be at most ~2GB.
At that scale, you might consider whether you really have to broadcast
these values, or want to handle th
I am trying to broadcast a large 5GB variable using Spark 1.2.0. I get the
following exception when the size of the broadcast variable exceeds 2GB. Any
ideas on how I can resolve this issue?
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.ma
I have multiple RDD[(String, String)] that store (docId, docText) pairs, e.g.
rdd1: ("id1", "Long text 1"), ("id2", "Long text 2"), ("id3", "Long text 3")
rdd2: ("id1", "Long text 1 A"), ("id2", "Long text 2 A")
rdd3: ("id1", "Long text 1 B")
Then, I want to merge all RDDs. If there is dup
Except that transformations don't have an exactly-once guarantee, so this
way of doing counters may produce different answers across various forms of
failures and speculative execution.
On Fri, Feb 13, 2015 at 8:56 AM, Sean McNamara
wrote:
> .map is just a transformation, so no work will actual
I'm playing around with Spark on Windows and have to worker nodes running by
starting them manually using a script that contains the following
set SPARK_HOME=C:\dev\programs\spark-1.2.0
set SPARK_MASTER_IP=master.brad.com
spark-class org.apache.spark.deploy.worker.Worker
spark://master.brad.com:70
.map is just a transformation, so no work will actually be performed until
something takes action against it. Try adding a .count(), like so:
inputRDD.map { x => {
counter += 1
} }.count()
In case it is helpful, here are the docs on what exactly the transformations
and actions are:
htt
I am trying to implement counters in Spark and I guess Accumulators are the
way to do it.
My motive is to update a counter in map function and access/reset it in the
driver code. However the /println/ statement at the end still yields value
0(It should 9). Am I doing something wrong?
def main(arg
Agree, it's correct in the linear methods doc page too. You can open a
PR for this simple 'typo' fix; I imagine it just wasn't updated or
fixed with the others.
On Fri, Feb 13, 2015 at 4:26 PM, Emre Sevinc wrote:
> Hello,
>
> I was trying the streaming kmeans clustering example in the official
>
Hello,
I was trying the streaming kmeans clustering example in the official
documentation at:
http://spark.apache.org/docs/1.2.0/mllib-clustering.html
But I've got a type error when I tried to compile the code:
[error] found :
org.apache.spark.streaming.dstream.DStream[org.apache.spark.ml
You can also export the variable SPARK_PRINT_LAUNCH_COMMAND before launching a
spark-submit command to display the java command that will be launched, e.g.:
export SPARK_PRINT_LAUNCH_COMMAND=1
/opt/spark/bin/spark-submit --master yarn --deploy-mode cluster --class
kelkoo.SparkAppTemplate --jars
Hi all,
I'd like to build/use column oriented RDDs in some of my Spark code. A
normal Spark RDD is stored as row oriented object if I understand
correctly.
I'd like to leverage some of the advantages of a columnar memory format.
Shark (used to) and SparkSQL uses a columnar storage format using pr
Hi,
I believe SizeOf.jar may calculate the wrong size for you.
Spark has a util call SizeEstimator located in
org.apache.spark.util.SizeEstimator. And some one extracted it out in
https://github.com/phatak-dev/java-sizeof/blob/master/src/main/scala/com/madhukaraphatak/sizeof/SizeEstimator.scal
Hi,
can some one guide how to get SQL Exception trapped for query executed using
SchemaRDD,
i mean suppose table not found
thanks in advance,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-SchemaRDD-SQL-exceptions-i-e-table-not-found-excepti
In
https://github.com/apache/spark/blob/master/python/pyspark/join.py#L38,
wouldn't it help to change the lines
vs = rdd.map(lambda (k, v): (k, (1, v)))
ws = other.map(lambda (k, v): (k, (2, v)))
to
vs = rdd.mapValues(lambda v: (1, v))
ws = other.mapValues(lambda v: (2, v))
?
Use below configuration if u r using 1.2 version:-
SET spark.shuffle.consolidateFiles=true;
SET spark.rdd.compress=true;
SET spark.default.parallelism=1000;
SET spark.deploy.defaultCores=54;
Thanks
Puneet.
-Original Message-
From: Sean Owen [mailto:so...@cloudera.com]
Sent: Friday, Febr
Thanks for the reply, I am trying to setup a streaming as a service
approach, using the framework that is used for spark-jobserver. for that I
would need to handle asynchronous operations that are initiated from
outside of the stream. Do you think it is not possible?
On Fri Feb 13 2015 at 10:14:1
Sure it's possible, but you would use Streaming to update some shared
state, and create another service that accessed that shared state too.
On Fri, Feb 13, 2015 at 11:57 AM, Tamas Jambor wrote:
> Thanks for the reply, I am trying to setup a streaming as a service
> approach, using the framework
Does that mean partitioning does not work in Python? Or does this only
effect joining?
On 2015-02-12 19:27, Davies Liu wrote:
The feature works as expected in Scala/Java, but not implemented in
Python.
On Thu, Feb 12, 2015 at 9:24 AM, Imran Rashid
wrote:
I wonder if the issue is that these
Thank you for clarification, Sean.
2015-02-13 14:16 GMT+04:00 Sean Owen :
> It is a wrapper whose API is logically the same, but whose method
> signature make more sense in Java. You can call the Scala API in Java
> without too much trouble, but it gets messy when you have to manually
> grapple w
18 cores or 36? doesn't probably matter.
For this case where you have some overhead per partition of setting up
the DB connection, it may indeed not help to chop up the data more
finely than your total parallelism. Although that would imply quite an
overhead. Are you doing any other expensive initi
Hello,
In Spark programming guide
(http://spark.apache.org/docs/1.2.0/programming-guide.html) there is a
recommendation:
Typically you want 2-4 partitions for each CPU in your cluster.
We have a Spark Master and two Spark workers each with 18 cores and 18 GB of
RAM.
In our application we use Jdbc
Frankly no good/standard way to visualize streaming data. So far I have
found HBase as good intermediate store to store data from streams &
visualize it by a play based framework & d3.js.
Regards
Mayur
On Fri Feb 13 2015 at 4:22:58 PM Kevin (Sangwoo) Kim
wrote:
> I'm not very sure for CDH 5.3,
I'm not very sure for CDH 5.3,
but now Zeppelin works for Spark 1.2 as spark-repl has been published in
Spark 1.2.1
Please try again!
On Fri Feb 13 2015 at 3:55:19 PM Su She wrote:
> Thanks Kevin for the link, I have had issues trying to install zeppelin as
> I believe it is not yet supported fo
It is a wrapper whose API is logically the same, but whose method
signature make more sense in Java. You can call the Scala API in Java
without too much trouble, but it gets messy when you have to manually
grapple with ClassTag from Java for example.
There is not an implicit conversion since it is
You call awaitTermination() in the main thread, and indeed it blocks
there forever. From there Spark Streaming takes over, and is invoking
the operations you set up. Your operations have access to the data of
course. That's the model; you don't make external threads that reach
in to Spark Streaming
A number of comments:
310GB is probably too large for an executor. You probably want many
smaller executors per machine. But this is not your problem.
You didn't say where the OutOfMemoryError occurred. Executor or driver?
Tuple2 is a Scala type, and a general type. It is appropriate for
general
Hi,
is SparkSQL + Parquet suitable to replicate a star schema ?
Paolo Platter
AgileLab CTO
Hi all,
I am trying to come up with a workflow where I can query streams
asynchronously. The problem I have is a ssc.awaitTermination() line blocks
the whole thread, so it is not straightforward to me whether it is possible
to get hold of objects from streams once they are started. any suggestion
Hello,
I have the following scenario and was wondering if I can use Spark to
address it.
I want to query two different data stores (say, ElasticSearch and MySQL)
and then merge the two result sets based on a join key between the two. Is
it appropriate to use Spark to do this join, if the intermed
Hello,
To convert existing Map Reduce jobs to Spark, I need to implement window
functions such as FIRST_VALUE, LEAD, LAG and so on. For example,
FIRST_VALUE function:
Source (1st column is key):
A, A1
A, A2
A, A3
B, B1
B, B2
C, C1
and the result should be
A, A1, A1
A, A2, A1
A, A3, A1
B, B1, B
Get it. Thanks Reynold and Andrew!
Jianshi
On Thu, Feb 12, 2015 at 12:25 AM, Andrew Or wrote:
> Hi Jianshi,
>
> For YARN, there may be an issue with how a recently patch changes the
> accessibility of the shuffle files by the external shuffle service:
> https://issues.apache.org/jira/browse/SPA
65 matches
Mail list logo