No, never really resolved the problem, except by increasing the permgem
space which only partially solved it. Still have to restart the job
multiple times so make the whole job complete (it stores intermediate
results).
The parquet data sources have about 70 columns, and yes Cheng, it works
fine
Hello,
I've set spark.deploy.spreadOut=false in spark-env.sh.
export SPARK_MASTER_OPTS=-Dspark.deploy.defaultCores=4
-Dspark.deploy.spreadOut=false
There are 3 workers each with 4 cores. Spark-shell was started with noof
cores = 6.
Spark UI show that one executor was used with 6 cores.
Is
Thanks !
I am using spark streaming 1.3 , And if some post fails because of any
reason, I will store the offset of that message in another kafka topic. I
want to read these offsets in another spark job and from them the original
kafka topic's messages based on these offsets-
So is it possible
I am trying to run Spark applications with the driver running locally and
interacting with a firewalled remote cluster via a SOCKS proxy.
I have to modify the hadoop configuration on the *local machine* to try to
make this work, adding
property
Yes, look at KafkaUtils.createRDD
On Wed, Jul 22, 2015 at 11:17 AM, Shushant Arora shushantaror...@gmail.com
wrote:
Thanks !
I am using spark streaming 1.3 , And if some post fails because of any
reason, I will store the offset of that message in another kafka topic. I
want to read these
How many columns are there in these Parquet files? Could you load a
small portion of the original large dataset successfully?
Cheng
On 6/25/15 5:52 PM, Anders Arpteg wrote:
Yes, both the driver and the executors. Works a little bit better with
more space, but still a leak that will cause
Yeah, the benefit of `saveAsTable` is that you don't need to deal with
schema explicitly, while the benefit of ALTER TABLE is you still have a
standard vanilla Hive table.
Cheng
On 7/22/15 11:00 PM, Dean Wampler wrote:
While it's not recommended to overwrite files Hive thinks it
understands,
Hello all,
We are having a major performance issue with the Spark, which is holding us
from going live.
We have a job that carries out computation on log files and write the
results into Oracle DB.
The reducer 'reduceByKey' have been set to parallelize by 4 as we don't
want to establish too
I have a protected s3 bucket that requires a certain IAM role to access. When
I start my cluster using the spark-ec2 script, everything works just fine until
I try to read from that part of s3. Here is the command I am using:
./spark-ec2 -k KEY -i KEY_FILE.pem
Hi guys,
I noticed that too. Anders, can you confirm that it works on Spark 1.5
snapshot? This is what I tried at the end. It seems it is 1.4 issue.
Best Regards,
Jerry
On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote:
No, never really resolved the problem, except by
Hi Burak,
Looking at the source code, the intermediate RDDs used in ALS.train() are
persisted during the computation using intermediateRDDStorageLevel (default
value is StorageLevel.MEMORY_AND_DISK) - see
Hi Jonathan,
I believe calling persist with StorageLevel.NONE doesn't do anything.
That's why the unpersist has an if statement before it.
Could you give more information about your setup please? Number of cores,
memory, number of partitions of ratings_train?
Thanks,
Burak
On Wed, Jul 22, 2015
Hi,
We have a requirement wherein we need to keep RDDs in memory between Spark
batch processing that happens every one hour. The idea here is to have RDDs
that have active user sessions in memory between two jobs so that once a job
processing is done and another job is run after an hour the RDDs
To be Unpersisted the RDD must be persisted first. If it's set to None, then
it's not persisted, and as such does not need to be freed. Does that make sense
?
Thank you,
Ilya Ganelin
-Original Message-
From: Stahlman, Jonathan
Hi Andrew
I tried many different combinations, but still no change in the amount of
shuffle bytes spilled to disk by checking the UI. I made sure the configuration
have been applied by checking Spark UI/Environment. I only see changes in
shuffle bytes spilled if I disable spark.shuffle.spill
For what it's worth, my data set has around 85 columns in Parquet format as
well. I have tried bumping the permgen up to 512m but I'm still getting
errors in the driver thread.
On Wed, Jul 22, 2015 at 1:20 PM, Jerry Lam chiling...@gmail.com wrote:
Hi guys,
I noticed that too. Anders, can you
Hi, Andrew,
If I broadcast the Map:
val map2=sc.broadcast(map1)
I will get compilation error:
org.apache.spark.broadcast.Broadcast[scala.collection.immutable.Map[Int,String]]
does not take parameters
[error] val matchs= Vecs.map(term=term.map{case (a,b)=(map2(a),b)})
Seems it's still an
Thanks Andrew, exactly.
2015-07-22 14:26 GMT-05:00 Andrew Or and...@databricks.com:
Hi Dan,
`map2` is a broadcast variable, not your map. To access the map on the
executors you need to do `map2.value(a)`.
-Andrew
2015-07-22 12:20 GMT-07:00 Dan Dong dongda...@gmail.com:
Hi, Andrew,
Hi Dan,
`map2` is a broadcast variable, not your map. To access the map on the
executors you need to do `map2.value(a)`.
-Andrew
2015-07-22 12:20 GMT-07:00 Dan Dong dongda...@gmail.com:
Hi, Andrew,
If I broadcast the Map:
val map2=sc.broadcast(map1)
I will get compilation error:
That makes a lot of sense, thanks for the concise answer!
On Wed, Jul 22, 2015 at 4:10 PM, Andrew Or and...@databricks.com wrote:
Hi Michael,
In general, driver related properties should not be set through the
SparkConf. This is because by the time the SparkConf is created, we have
already
Hi Srikanth,
It does look like a bug. Did you set `spark.executor.cores` in your
application by any chance?
-Andrew
2015-07-22 8:05 GMT-07:00 Srikanth srikanth...@gmail.com:
Hello,
I've set spark.deploy.spreadOut=false in spark-env.sh.
export
Tachyon is one way. Also check out the Spark Job Server
https://github.com/spark-jobserver/spark-jobserver .
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-keep-RDDs-in-memory-between-two-different-batch-jobs-tp23957p23958.html
Sent from the
I was about say whatever the previous post said,so +1 to the previous
post,from my understanding (gut feeling) of your requirement it very easy to
do this with spark-job-server.
--
View this message in context:
Hi I have a DataFrame which I need to convert into JavaRDD and back to
DataFrame I have the following code
DataFrame sourceFrame =
hiveContext.read().format(orc).load(/path/to/orc/file);
//I do order by in above sourceFrame and then I convert it into JavaRDD
JavaRDDRow modifiedRDD =
The first question I would ask is have you determined whether you have a
performance issue writing to Oracle? In particular how many commits are you
making? If you are issuing a lot of commits that would be a performance problem.
Robin
On 22 Jul 2015, at 19:11, diplomatic Guru
Hi group,
I seem to have encountered a weird problem with 'spark-submit' and manually
setting sparkconf values in my applications.
It seems like setting the configuration values spark.executor.memory
and spark.driver.memory don't have any effect, when they are set from
within my application
Hi all,
I am using the databricks csv library to load some data into a data frame.
https://github.com/databricks/spark-csv
I am trying to confirm that failfast mode works correctly and aborts
execution upon receiving an invalid csv file. But have not been able to
see it fail yet after testing
Cool. Thanks!
Srikanth
On Wed, Jul 22, 2015 at 3:12 PM, Andrew Or and...@databricks.com wrote:
Hi Srikanth,
I was able to reproduce the issue by setting `spark.cores.max` to a number
greater than the number of cores on a worker. I've filed SPARK-9260 which I
believe is already being fixed
Hello again,
In trying to understand the caching of intermediate RDDs by ALS, I looked into
the source code and found what may be a bug. Looking here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L230
you see that ALS.train()
Actually, I should clarify - Tachyon is a way to keep your data in RAM, but
it's not exactly the same as keeping it cached in Spark. Spark Job Server
is a way to keep it cached in Spark.
--
View this message in context:
In spark streaming 1.3 -
Say I have 10 executors each with 4 cores so in total 40 tasks in parllel
at once. If I repartition kafka directstream to 40 partitions vs say I have
in kafka topic 300 partitions - which one will be more efficient , Should I
repartition the kafka stream equal to num of
Hi Srikanth,
I was able to reproduce the issue by setting `spark.cores.max` to a number
greater than the number of cores on a worker. I've filed SPARK-9260 which I
believe is already being fixed in https://github.com/apache/spark/pull/7274.
Thanks for reporting the issue!
-Andrew
2015-07-22
Hi,
I’m stuck with the same issue, but I see
org.apache.hadoop.fs.s3native.NativeS3FileSystem in the hadoop-core:1.0.4
(that’s the current hadoop-client I use) and this far is transitive dependency
that comes from spark itself. I’m using custom build of spark 1.3.1 with
hadoop-client 1.0.4.
Hi,
It would be whatever's left in the JVM. This is not explicitly controlled
by a fraction like storage or shuffle. However, the computation usually
doesn't need to use that much space. In my experience it's almost always
the caching or the aggregation during shuffles that's the most memory
Hi,
We tested this receiver internally in stratio sparkta, and it works fine,
If you will try the receiver, we're open to your collaboration, your issues
will be wellcome.
Regards
A.Rincón
Stratio software architect
2015-07-22 8:15 GMT+02:00 Tathagata Das t...@databricks.com:
You could
If I understand this correctly, you could join area_code_user and
area_code_state and then flat map to get
user, areacode, state. Then groupby/reduce by user.
You can also try some join optimizations like partitioning on area code or
broadcasting smaller table depending on size of
Hi,
The setting of 0.2 / 0.6 looks reasonable to me. Since you are not using
caching at all, have you tried trying something more extreme, like 0.1 /
0.9? Since disabling spark.shuffle.spill didn't cause an OOM this setting
should be fine. Also, one thing you could do is to verify the shuffle
Hi Dan,
If the map is small enough, you can just broadcast it, can't you? It
doesn't have to be an RDD. Here's an example of broadcasting an array and
using it on the executors:
That was a pseudo code, working version would look like this:
val stream = TwitterUtils.createStream(ssc, None)
val hashTags = stream.flatMap(status = status.getText.split(
).filter(_.startsWith(#))).map(x = (x.toLowerCase,1))
val topCounts10 = hashTags.map((_,
Hi guys!
I'm a new in mesos. I have two spark application (one streaming and one
batch). I want to run both app in mesos cluster. Now for testing I want to
run in docker container so I started a simple redjack/mesos-master, but I
think a lot of think unclear for me (both mesos and spark-mesos).
Yes, you could unroll from the iterator in batch of 100-200 and then post
them in multiple rounds.
If you are using the Kafka receiver based approach (not Direct), then the
raw Kafka data is stored in the executor memory. If you are using Direct
Kafka, then it is read from Kafka directly at the
For Java, do
OffsetRange[] offsetRanges = ((HasOffsetRanges)rdd*.rdd()*).offsetRanges();
If you fix that error, you should be seeing data.
You can call arbitrary RDD operations on a DStream, using
DStream.transform. Take a look at the docs.
For the direct kafka approach you are doing,
- tasks
With DirectKafkaStream there are two approaches.
1. you increase the number of KAfka partitions Spark will automatically
read in parallel
2. if that's not possible, then explicitly repartition only if there are
more cores in the cluster than the number of Kafka partitions, AND the
first map-like
You could contact the authors of the spark-packages.. maybe that will help?
On Mon, Jul 20, 2015 at 6:41 AM, Jeetendra Gangele gangele...@gmail.com
wrote:
Thanks Todd,
I m not sure whether somebody has used it or not. can somebody confirm if
this integrate nicely with Spark streaming?
On
Since Hive doesn’t support schema evolution, you’ll have to update the
schema stored in metastore somehow. For example, you can create a new
external table with the merged schema. Say you have a Hive table |t1|:
|CREATE TABLE t1 (c0 INT, c1 DOUBLE); |
By default, this table is stored in HDFS
Hi
RDD solution:
u = [(615,1),(720,1),(615,2)]
urdd=sc.parallelize(u,1)
a1 = [(615,'T'),(720,'C')]
ardd=sc.parallelize(a1,1)
def addString(s1,s2):
... return s1+','+s2
j = urdd.join(ardd).map(lambda t:t[1]).reduceByKey(addString)
print j.collect()
[(2, 'T'), (1, 'C,T')]
However, if
I have a spark cluster running in client mode with driver outside the spark
cluster. I want to scale the cluster after an application is submitted. In
order to do this, I'm creating new workers and they are getting registered
with master but issue I'm seeing is; running application does not use
Real-time is, of course, relative but you’ve mentioned microsecond level. Spark
is designed to process large amounts of data in a distributed fashion. No
distributed system I know of could give any kind of guarantees at the
microsecond level.
Robin
On 22 Jul 2015, at 11:14, Louis Hust
Hi all!
I have MatrixFactorizationModel object. If I'm trying to recommend products
to single user right after constructing model through ALS.train(...) then it
takes 300ms (for my data and hardware). But if I save model to disk and load
it back then recommendation takes almost 2000ms. Also Spark
Hi, all
I am using spark jar in standalone mode, fetch data from different mysql
instance and do some action, but i found the time is at second level.
So i want to know if spark job is suitable for real time query which at
microseconds?
Thank you very much Shivaram. I’ve got it working on Mac now by specifying the
namespace.
Using SparkR:::parallelize() iso just parallelize()
Wkr,
Serge
On 21 Jul 2015, at 17:20, Shivaram Venkataraman
shiva...@eecs.berkeley.edumailto:shiva...@eecs.berkeley.edu wrote:
There shouldn't be
I do a simple test using spark in standalone mode(not cluster),
and found a simple action take a few seconds, the data size is small, just
few rows.
So each spark job will cost some time for init or prepare work no matter
what the job is?
I mean if the basic framework of spark job will cost
HI All,
I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
complex queries analysis on this data.Queries like AND queries involved
multiple fields
So my question in which which format I should store the data in HDFS so
that processing will be fast for such kind of queries?
As you know, the hadoop versions and so on are available in the spark build
files, iirc the top level pox.xml has all the maven variables for versions.
So I think if you just build hadoop locally (i.e. build it as it to
2.2.1234-SNAPSHOT and mvn install it), you should be able to change the
you can use spark rest job server(or any other solution that provides long
running spark context) so that you won't pay this bootstrap time on each
query
in addition : if you have some rdd that u want your queries to be executed
on, you can cache this rdd in memory(depends on ur cluster memory
Hello!
I don't understand why, but I can't read data from my parquet file. I made
parquet file from json file and read it to data frame:
/df.printSchema()
|-- param: struct (nullable = true)
||-- FORM: string (nullable = true)
||-- URL: string (nullable = true)/
/When I try to read
Hi All,
I have a cluster with spark 1.4.
I am trying to save data to mysql but getting error
Exception in thread main java.sql.SQLException: No suitable driver found
for jdbc:mysql://.rds.amazonaws.com:3306/DAE_kmer?user=password=
*I looked at - https://issues.apache.org/jira/browse/SPARK-8463
Can you provide an example of an and query ? If you do just look-up you
should try Hbase/ phoenix, otherwise you can try orc with storage index
and/or compression, but this depends on how your queries look like
Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a
écrit :
HI
Thanks Robin for your reply.
I'm pretty sure that writing to Oracle is taking longer as when writing to
HDFS is only taking ~5minutes.
The job is writing about ~5 Million of records. I've set the job to call
executeBatch() when the batchSize reaches 200,000 of records, so I assume
that commit
Hello,
I'm trying to link records from two large data sources. Both datasets have
almost same number of rows.
Goal is to match records based on multiple columns.
val matchId =
SFAccountDF.as(SF).join(ELAccountDF.as(EL)).where($SF.Email ===
$EL.EmailAddress || $SF.Phone === EL.Phone)
Hi,
Assume a following scenario:
The spark standalone cluster has 10 cores in total, I have an application that
will request 12 cores. Will the application run with fewer cores than requested
or will it simply wait for ever since there are only 10 cores available.
I would guess it will be run
I'm currently using Spark 1.4 in standalone mode.
I've forked the Apache Hive branch from https://github.com/pwendell/hive
https://github.com/pwendell/hive and customised in the following way.
Added a thread local variable in SessionManager class. And I'm setting the
session variable in my
I do not think you can put all your queries into the row key without
duplicating the data for each query. However, this would be more last
resort.
Have you checked out phoenix for Hbase? This might suit your needs. It
makes it much simpler, because it provided sql on top of Hbase.
Nevertheless,
Can anybody help here?
On 22 July 2015 at 10:38, Jeetendra Gangele gangele...@gmail.com wrote:
Hi All,
I am trying to capture the user activities for real estate portal.
I am using RabbitMS and Spark streaming combination where all the Events I
am pushing to RabbitMQ and then 1 secs micro
Query will be something like that
1. how many users visited 1 BHK flat in last 1 hour in given particular area
2. how many visitor for flats in give area
3. list all user who bought given property in last 30 days
Further it may go too complex involving multiple parameters in my query.
The
Does it also support insert operations ?
On Jul 22, 2015 4:53 PM, Bing Xiao (Bing) bing.x...@huawei.com wrote:
We are happy to announce the availability of the Spark SQL on HBase
1.0.0 release.
http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this
I'm trying to do some simple counting and aggregation in an IPython notebook
with Spark 1.4.0 and I have encountered behavior that looks like a bug.
When I try to filter rows out of an RDD with a column name of count I get a
large error message. I would just avoid naming things count, except
Parquet
Mohammed
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Wednesday, July 22, 2015 5:48 AM
To: user
Subject: Need help in SparkSQL
HI All,
I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex
queries analysis on this data.Queries like AND queries
Hi,
I have a simple test spark program as below, the strange thing is that it
runs well under a spark-shell, but will get a runtime error of
java.lang.NoSuchMethodError:
in spark-submit, which indicate the line of:
val maps2=maps.collect.toMap
has problem. But why the compilation has no
Yes. Tachyon can handle this well: http://tachyon-project.org/
Best,
Haoyuan
On Wed, Jul 22, 2015 at 10:56 AM, swetha swethakasire...@gmail.com wrote:
Hi,
We have a requirement wherein we need to keep RDDs in memory between Spark
batch processing that happens every one hour. The idea here
Hi all,
I am trying to do wurfl lookup in a spark cluster and getting exceptions, I am
pretty sure that the same thing works in small scale. But it fails when I tried
to do it in spark. I used spark-ec2/copy-dir to copy the wurfl library to
workers already and launched the spark-shell with
We are happy to announce the availability of the Spark SQL on HBase 1.0.0
release. http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed Astro, include:
* Systematic and powerful handling of data pruning and intelligent
scan, based
try setting --driver-class-path
On Wed, Jul 22, 2015 at 3:45 PM, roni roni.epi...@gmail.com wrote:
Hi All,
I have a cluster with spark 1.4.
I am trying to save data to mysql but getting error
Exception in thread main java.sql.SQLException: No suitable driver found
for
Is it complaining about collect or toMap? In either case this error is
indicative of an old version usually -- any chance you have an old
installation of Spark somehow? Or scala? You can try running spark-submit
with --verbose. Also, when you say it runs with spark-shell do you run
spark shell in
I would be interested in the answer to this question, plus the relationship
between those and registerTempTable()
Pedro
On Tue, Jul 21, 2015 at 1:59 PM, Brandon White bwwintheho...@gmail.com
wrote:
A few questions about caching a table in Spark SQL.
1) Is there any difference between caching
Additionally have you tried enclosing count in `backticks`?
On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust mich...@databricks.com
wrote:
I believe this will be fixed in Spark 1.5
https://github.com/apache/spark/pull/7237
On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T
I believe this will be fixed in Spark 1.5
https://github.com/apache/spark/pull/7237
On Wed, Jul 22, 2015 at 3:04 PM, Young, Matthew T matthew.t.yo...@intel.com
wrote:
I'm trying to do some simple counting and aggregation in an IPython
notebook with Spark 1.4.0 and I have encountered behavior
I am also having problems with triangle count - seems like this algorithm
is very memory consuming (I could not process even small graphs ~ 5 million
Vertices and 70 million Edges with less the 32 GB RAM on EACH machine).
What if I have graphs with billion edges, what amount of RAM do I need then?
Hi Anders,
Did you ever get to the bottom of this issue? I'm encountering it too, but
only in yarn-cluster mode running on spark 1.4.0. I was thinking of
trying 1.4.1 today.
Michael
On Thu, Jun 25, 2015 at 5:52 AM, Anders Arpteg arp...@spotify.com wrote:
Yes, both the driver and the
This page, http://spark.apache.org/docs/latest/running-on-mesos.html,
covers many of these questions. If you submit a job with the option
--supervise, it will be restarted if it fails.
You can use Chronos for scheduling. You can create a single streaming job
with a 10 minute batch interval, if
While it's not recommended to overwrite files Hive thinks it understands,
you can add the column to Hive's metastore using an ALTER TABLE command
using HiveQL in the Hive shell or using HiveContext.sql():
ALTER TABLE mytable ADD COLUMNS col_name data_type
See
My code like below:
MapString, String t11opt = new HashMapString, String();
t11opt.put(url, DB_URL);
t11opt.put(dbtable, t11);
DataFrame t11 = sqlContext.load(jdbc, t11opt);
t11.registerTempTable(t11);
...the same for
I could get a little further :
- installed spark-1.4.1-without-hadoop
- unpacked hadoop 2.7.1
- added the folowing in spark-env.sh
HADOOP_HOME=/opt/hadoop-2.7.1/
Are you running the Spark cluster in standalone or YARN?
In standalone, the application gets the available resources when it starts.
With YARN, you can try to turn on the setting
*spark.dynamicAllocation.enabled*
See https://spark.apache.org/docs/latest/configuration.html
On Wed, Jul 22, 2015 at
Are you using jdbc server?
Paolo
Inviata dal mio Windows Phone
Da: Louis Hustmailto:louis.h...@gmail.com
Inviato: 22/07/2015 13:47
A: Robin Eastmailto:robin.e...@xense.co.uk
Cc: user@spark.apache.orgmailto:user@spark.apache.org
Oggetto: Re: Is spark suitable
Hi,
I tried to enable Master metrics source (to get number of running/waiting
applications etc), and connected it to Graphite.
However, when these are enabled, application metrics are also sent.
Is it possible to separate them, and send only master metrics without
applications?
I see that
86 matches
Mail list logo