Hi Spark users,
could someone help me out.
My company has a fully functioning spark cluster with shark running on
top of it (as part of the same cluster, on the same LAN) . I'm
interested in running raw spark code against it but am running against
the following issue -- it seems like the machine
does sbt show full-classpath show spark-core on the classpath? I am
still pretty new to scala but it seems like you have val sparkCore
= org.apache.spark %% spark-core% V.spark %
provided -- I believe the provided part means it's in your
classpath. Spark-shell script sets up
I am able to run standalone apps. I think you are making one mistake
that throws you off from there onwards. You don't need to put your app
under SPARK_HOME. I would create it in its own folder somewhere, it
follows the rules of any standalone scala program (including the
layout). In the giude,
, and
someone will point out that I'm missing a small step that will make the
Spark distribution of sbt work!
Diana
On Mon, Mar 24, 2014 at 4:52 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Diana, I just tried it on a clean Ubuntu machine, with Spark 0.8
(since like other folks I had sbt
Ognen, can you comment if you were actually able to run two jobs
concurrently with just restricting spark.cores.max? I run Shark on the
same cluster and was not able to see a standalone job get in (since
Shark is a long running job) until I restricted both spark.cores.max
_and_
Nan (or anyone who feels they understand the cluster architecture well),
can you clarify something for me.
From reading this user group and your explanation above it appears that the
cluster master is only involved in this during application startup -- to
allocate executors(from what you wrote
Nicholas, I'm in Boston and would be interested in a Spark group. Not
sure if you know this -- there was a meetup that never got off the
ground. Anyway, I'd be +1 for attending. Not sure what is involved in
organizing. Seems a shame that a city like Boston doesn't have one.
On Mon, Mar 31, 2014
I might be wrong here but I don't believe it's discouraged. Maybe part
of the reason there's not a lot of examples is that sql2rdd returns an
RDD (TableRDD that is
https://github.com/amplab/shark/blob/master/src/main/scala/shark/SharkContext.scala).
I haven't done anything too complicated yet but
times, or if it appeared that the launch times
were serial with respect to the durations. If for some reason Spark started
using only one executor, say, each task would take the same duration but
would be executed one after another.
On Tue, Apr 8, 2014 at 8:11 AM, Yana Kadiyska yana.kadiy
I think what you want to do is set spark.driver.port to a fixed port.
On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote:
Hi All,
I encountered this problem when the firewall is enabled between the
spark-shell and the Workers.
When I launch spark-shell in yarn-client
Hi, I am running into a pretty concerning issue with Shark (granted I'm
running v. 0.8.1).
I have a Spark slave node that has run out of disk space. When I try to
start Shark it attempts to deploy the application to a directory on that
node, fails and eventually gives up (I see a Master Removed
Does the spark UI show your program running? (http://spark-masterIP:8118).
If the program is listed as running you should be able to see details via
the UI. In my experience there are 3 sets of logs -- the log where you're
running your program (the driver), the log on the master node, and the log
Hi, I see this has been asked before but has not gotten any satisfactory
answer so I'll try again:
(here is the original thread I found:
http://mail-archives.apache.org/mod_mbox/spark-user/201403.mbox/%3c1394044078706-2312.p...@n3.nabble.com%3E
)
I have a set of workers dying and coming back
Like many people, I'm trying to do hourly counts. The twist is that I don't
want to count per hour of streaming, but per hour of the actual occurrence
of the event (wall clock, say -mm-dd HH).
My thought is to make the streaming window large enough that a full hour of
streaming data would fit
Hi folks, hoping someone can explain to me what's going on:
I have the following code, largely based on RecoverableNetworkWordCount
example (
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala
):
I am setting
.
class Foo extends Serializable {
...
}
val foo = new Foo()
foo.field1 = blah
lines.map(line = { println(foo) }) // now you should see the field
values you set.
On Mon, Jun 23, 2014 at 7:44 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks, hoping someone can explain to me
Hi community, this one should be an easy one:
I have left spark.task.maxFailures to it's default (which should be
4). I see a job that shows the following statistics for Tasks:
Succeeded/Total
7109/819 (1 failed)
So there were 819 tasks to start with. I have 2 executors in that
cluster. From
Are you looking at the driver log? (e.g. Shark?). I see a ton of
information in the INFO category on what query is being started, what
stage is starting and which executor stuff is sent to. So I'm not sure
if you're saying you see all that and you need more, or that you're
not seeing this type of
Are you saying that both streams come in at the same rate and you have
the same batch interval but the batch size ends up different? i.e. two
datapoints both arriving at X seconds after streaming starts end up in
two different batches? How do you define real time values for both
streams? I am
A lot of things can get funny when you run distributed as opposed to
local -- e.g. some jar not making it over. Do you see anything of
interest in the log on the executor machines -- I'm guessing
192.168.222.152/192.168.222.164. From here
as
it gets but still going on 32+ hours.
On Wed, Jul 2, 2014 at 2:40 AM, Mayur Rustagi mayur.rust...@gmail.com wrote:
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
On Mon, Jun 30, 2014 at 8:09 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi,
our
The scripts that Xiangrui mentions set up the classpath...Can you run
./run-example for the provided example sucessfully?
What you can try is set SPARK_PRINT_LAUNCH_COMMAND=1 and then call
run-example -- that will show you the exact java command used to run
the example at the start of execution.
class java.io.IOException: Cannot run program
/Users/aris.vlasakakis/Documents/spark-1.0.0/bin/compute-classpath.sh
(in directory .): error=2, No such file or directory
By any chance, are your SPARK_HOME directories different on the
machine where you're submitting from and the cluster? I'm on an
Does this line println(Retuning +string) from the hash function
print what you expect? If you're not seeing that output in the
executor log I'd also put some debug statements in case other, since
your match in the interesting case is conditioned on if(
fieldsList.contains(index)) -- maybe that
I do not believe the order of points in a distributed RDD is in any
way guaranteed. For a simple test, you can always add a last column
which is an id (make it double and throw it in the feature vector).
Printing the rdd back will not give you the points in file order. If
you don't want to go that
Hi,
I'm starting spark-shell like this:
SPARK_MEM=1g SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=3600
/spark/bin/spark-shell -c 3
but when I try to create a streaming context
val scc = new StreamingContext(sc, Seconds(10))
I get:
org.apache.spark.SparkException: Spark Streaming cannot be used
Hi folks, hoping someone who works with Streaming can help me out.
I have the following snippet:
val stateDstream =
data.map(x = (x, 1))
.updateStateByKey[State](updateFunc)
stateDstream.saveAsTextFiles(checkpointDirectory, partitions_test)
where data is a RDD of
case class
Hi all, trying to change defaults of where stuff gets written.
I've set -Dspark.local.dir=/spark/tmp and I can see that the setting is
used when the executor is started.
I do indeed see directories like spark-local-20140815004454-bb3f in this
desired location but I also see undesired stuff under
Sean, would this work --
rdd.mapPartitions { partition = Iterator(partition) }.foreach(
// Some setup code here
// save partition to DB
// Some cleanup code here
)
I tried a pretty simple example ... I can see that the setup and
cleanup are executed on the executor node, once per
Hi, I tried a similar question before and didn't get any answers,so I'll
try again:
I am using updateStateByKey, pretty much exactly as shown in the examples
shipping with Spark:
def createContext(master:String,dropDir:String, checkpointDirectory:String) = {
val updateFunc = (values:
Can you clarify the scenario:
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint(checkpointDirectory)
val stream = KafkaUtils.createStream(...)
val wordCounts = lines.flatMap(_.split( )).map(x = (x, 1L))
val wordDstream= wordCounts.updateStateByKey[Int](updateFunc)
I understand that the DB writes are happening from the workers unless you
collect. My confusion is that you believe workers recompute on recovery(nodes
computations which get redone upon recovery). My understanding is that
checkpointing dumps the RDD to disk and the cuts the RDD lineage. So I
JavaSparkContext java_SC = new JavaSparkContext(conf); is the spark
context. An application has a single spark context -- you won't be able to
keep calling this -- you'll see an error if you try to create a second
such object from the same application.
Additionally, depending on your
It seems like the next release will add a nice org.apache.spark.mllib.feature
package but what is the recommended way to normalize features in the
current release (1.0.2) -- I'm hoping for a general pointer here.
At the moment I have a RDD[LabeledPoint] and I can get
a
In the third case the object does not get shipped around. Each executor
will create it's own instance. I got bitten by this here:
http://apache-spark-user-list.1001560.n3.nabble.com/Help-with-object-access-from-mapper-simple-question-tt8125.html
On Thu, Sep 4, 2014 at 9:29 AM, Andrianasolo
These are just warnings from the web server. Normally your application will
have a UI page on port 4040. In your case, a little after the warning it
should bind just fine to another port (mine picked 4041). Im running on
0.9.1. Do you actually see the application failing? The main thing when
Which version are you using -- I can reproduce your issue w/ 0.9.2 but not
with 1.0.1...so my guess is that it's a bug and the fix hasn't been
backported... No idea on a workaround though..
On Fri, Sep 5, 2014 at 7:58 AM, Dhimant dhimant84.jays...@gmail.com wrote:
Hi,
I am getting type
spark-submit is a script which calls spark-class script. Can you output the
command that spark-class runs (say, by putting set -x before the very last
line?). You should see the java command that is being run. The scripts do
some parameter setting so it's possible you're missing something. It
If that library has native dependencies you'd need to make sure that the
native code is on all boxes and in the path with export
SPARK_LIBRARY_PATH=...
On Tue, Sep 9, 2014 at 10:17 AM, ayandas84 ayanda...@gmail.com wrote:
We have a small apache spark cluster of 6 computers. We are trying to
Tim, I asked a similar question twice:
here
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Cannot-get-executors-to-stay-alive-tt12940.html
and here
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-Executor-OOM-tt12383.html
and have not yet received any responses. I
Hi ladies and gents,
trying to get Thrift server up and running in an effort to replace Shark.
My first attempt to run sbin/start-thriftserver resulted in:
14/09/15 17:09:05 ERROR TThreadPoolServer: Error occurred during processing
of message.
java.lang.RuntimeException:
If your dashboard is doing ajax/pull requests against say a REST API you
can always create a Spark context in your rest service and use SparkSQL to
query over the parquet files. The parquet files are already on disk so it
seems silly to write both to parquet and to a DB...unless I'm missing
I don't think persist is meant for end-user usage. You might want to call
saveAsTextFiles, for example, if you're saving to the file system as
strings. You can also dump the DStream to a DB -- there are samples on this
list (you'd have to do a combo of foreachRDD and mapPartitions, likely)
On
,
So, user quotas need another data store, which can guarantee persistence
and afford frequent data updates/access. Is it correct?
Thanks,
Chia-Chun
2014-10-01 21:48 GMT+08:00 Yana Kadiyska yana.kadiy...@gmail.com:
I don't think persist is meant for end-user usage. You might want to call
Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC
server that comes with Spark 1.1.0.
However I observed that conditional functions do not work (I tried 'case'
and 'coalesce')
some string functions like 'concat' also did not work.
Is there a list of what's missing or a
, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi, in an effort to migrate off of Shark I recently tried the Thrift JDBC
server that comes with Spark 1.1.0.
However I observed that conditional functions do not work (I tried 'case'
and 'coalesce')
some string functions like 'concat' also did
when you're running spark-shell and the example, are you actually
specifying --master spark://master:7077 as shown here:
http://spark.apache.org/docs/latest/programming-guide.html#initializing-spark
because if you're not, your spark-shell is running in local mode and not
actually connecting to
clue.
On 03.10.14 19:37, Yana Kadiyska wrote:
when you're running spark-shell and the example, are you actually
specifying --master spark://master:7077 as shown here:
http://spark.apache.org/docs/latest/programming-guide.html#
initializing-spark
because if you're not, your spark-shell
in! These both look like they should have JIRAs.
On Fri, Oct 3, 2014 at 8:14 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Thanks -- it does appear that I misdiagnosed a bit: case works generally
but it doesn't seem to like the bit operation, which does not seem to work
(type of bit_field
If you just want the ratio of positive to all values per key (if I'm
reading right) this works
val reduced= input.groupByKey().map(grp=
grp._2.filter(v=v0).size.toFloat/grp._2.size)
reduced.foreach(println)
I don't think you need reduceByKey or combineByKey as you're not doing
anything where the
From this line : Removing executor app-20141015142644-0125/0 because it is
EXITED I would guess that you need to examine the executor log to see why
the executor actually exited. My guess would be that the executor cannot
connect back to your driver. But check the log from the executor. It should
you can out a try catch block in the map function and log the exception.
The only tricky part is that the exception log will be located in the
executor machine. Even if you don't do any trapping you should see the
exception stacktrace in the executors' stderr log which is visible through
the UI
This works for me
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val v1=Vectors.dense(Array(1d,2d))
val v2=Vectors.dense(Array(3d,4d))
val rows=sc.parallelize(List(v1,v2))
val mat=new RowMatrix(rows)
val svd:
Hi folks,
I'm trying to deploy the latest from master branch and having some trouble
with the assembly jar.
In the spark-1.1 official distribution(I use cdh version), I see the
following jars, where spark-assembly-1.1.0-hadoop2.0.0-mr1-cdh4.2.0.jar
contains a ton of stuff:
related but I don't
know.
On Fri, Oct 24, 2014 at 2:22 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks,
I'm trying to deploy the latest from master branch and having some
trouble
with the assembly jar.
In the spark-1.1 official distribution(I use cdh version), I see
Hi all,
I'm trying to set a pool for a JDBC session. I'm connecting to the thrift
server via JDBC client.
My installation appears to be good(queries run fine), I can see the pools
in the UI, but any attempt to set a variable (I tried
spark.sql.shuffle.partitions and
guoxu1231, I struggled with the Idea problem for a full week. Same thing --
clean builds under MVN/Sbt, but no luck with IDEA. What worked for me was
the solution posted higher up in this thread -- it's a SO post that
basically says to delete all iml files anywhere under the project
directory.
I see this when I start a worker and then try to start it again forgetting
it's already running (I don't use start-slaves, I start the slaves
individually with start-slave.sh). All this is telling you is that there is
already a running process on that machine. You can see it if you do a ps
CANNOT FIND ADDRESS occurs when your executor has crashed. I'll look
further down where it shows each task and see if you see any tasks failed.
Then you can examine the error log of that executor and see why it died.
On Wed, Oct 29, 2014 at 9:35 AM, akhandeshi ami.khande...@gmail.com wrote:
:
hi Yana,
in my case I did not start any spark worker. However, shark was definitely
running. Do you think that might be a problem?
I will take a look
Thank you,
--
*From:* Yana Kadiyska [yana.kadiy...@gmail.com]
*Sent:* Wednesday, October 29, 2014 9:45 AM
Adrian, do you know if this is documented somewhere? I was also under the
impression that setting a key's value to None would cause the key to be
discarded (without any explicit filtering on the user's part) but can not
find any official documentation to that effect
On Wed, Nov 12, 2014 at 2:43
Hi all, I am running HiveThriftServer2 and noticed that the process stays
up even though there is no driver connected to the Spark master.
I started the server via sbin/start-thriftserver and my namenodes are
currently not operational. I can see from the log that there was an error
on startup:
https://issues.apache.org/jira/browse/SPARK-4497
On Wed, Nov 19, 2014 at 1:48 PM, Michael Armbrust mich...@databricks.com
wrote:
This is not by design. Can you please file a JIRA?
On Wed, Nov 19, 2014 at 9:19 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi all, I am running
Hi sparkers,
I'm trying to use spark.sql.thriftserver.scheduler.pool for the first time
(earlier I was stuck because of
https://issues.apache.org/jira/browse/SPARK-4037)
I have two pools setup:
[image: Inline image 1]
and would like to issue a query against the low priority pool.
I am doing
Hi folks,
I'm wondering if someone has successfully used wildcards with a parquetFile
call?
I saw this thread and it makes me think no?
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201406.mbox/%3CCACA1tWLjcF-NtXj=pqpqm3xk4aj0jitxjhmdqbojj_ojybo...@mail.gmail.com%3E
I have a set
does spark-submit with SparkPi and spark-examples.jar work?
e.g.
./spark/bin/spark-submit --class org.apache.spark.examples.SparkPi
--master spark://xx.xx.xx.xx:7077 /path/to/examples.jar
On Tue, Dec 9, 2014 at 6:58 PM, Eric Tanner eric.tan...@justenough.com
wrote:
I have set up a cluster
Hi folks, wondering if anyone has thoughts. Trying to create something akin
to a materialized view (sqlContext is a HiveContext connected to external
metastore):
val last2HourRdd = sqlContext.sql(sselect * from mytable)
//last2HourRdd.first prints out a org.apache.spark.sql.Row = [...] with
in the parquet file? One way to test would be to just use
sqlContext.parquetFile(...) which infers the schema from the file instead
of using the metastore.
On Wed, Dec 10, 2014 at 12:46 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks, wondering if anyone has thoughts. Trying to create
Since you mentioned this, I had a related quandry recently -- it also says
that the forum archives *u...@spark.incubator.apache.org
u...@spark.incubator.apache.org/* *d...@spark.incubator.apache.org
d...@spark.incubator.apache.org *respectively, yet the Community page
clearly says to email the
Denny, I am not sure what exception you're observing but I've had luck with
2 things:
val table = sc.textFile(hdfs://)
You can try calling table.first here and you'll see the first line of the
file.
You can also do val debug = table.first.split(\t) which would give you an
array and you can
if you're running the test via sbt you can examine the classpath that sbt
uses for the test (show runtime:full-classpath or last run)-- I find this
helps once too many includes and excludes interact.
On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu amoc...@verticalscope.com
wrote:
I use spark
It looks to me like your executor actually crashed and didn't just finish
properly.
Can you check the executor log?
It is available in the UI, or on the worker machine, under $SPARK_HOME/work/
app-20150123155114-/6/stderr (unless you manually changed the work
directory location but in that
...@gmail.com wrote:
Do you mind filing a JIRA issue for this which includes the actual error
message string that you saw? https://issues.apache.org/jira/browse/SPARK
On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
I am not sure if you get the same exception as I
If you're talking about filter pushdowns for parquet files this also has to
be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's
off by default
On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wangxy...@gmail.com wrote:
Yes it works!
But the filter can't pushdown!!!
If
Just wondering if you've made any progress on this -- I'm having the same
issue. My attempts to help myself are documented here
http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAJ4HpHFVKvdNgKes41DvuFY=+f_nTJ2_RT41+tadhNZx=bc...@mail.gmail.com%3E
.
I don't believe I have the
scala sql(“SELECT user_id FROM actions where
conversion_aciton_id=20141210”)
*From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com]
*Sent:* Wednesday, January 14, 2015 11:12 PM
*To:* Pala M Muthaia
*Cc:* user@spark.apache.org
*Subject:* Re: Issues with constants in Spark HiveQL queries
Just a guess but what is the type of conversion_aciton_id? I do queries
over an epoch all the time with no issues(where epoch's type is bigint).
You can see the source here
https://github.com/apache/spark/blob/v1.2.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
--
not sure what
If the wildcard path you have doesn't work you should probably open a bug
-- I had a similar problem with Parquet and it was a bug which recently got
closed. Not sure if sqlContext.avroFile shares a codepath with
.parquetFile...you
can try running with bits that have the fix for .parquetFile or
I am not sure if you get the same exception as I do -- spark-shell2.cmd
works fine for me. Windows 7 as well. I've never bothered looking to fix it
as it seems spark-shell just calls spark-shell2 anyway...
On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko protsenk...@gmail.com
wrote:
I have a
-site.xml, and then
re-run the query. I can see significant differences by doing so.
I’ll open a JIRA and deliver a fix for this ASAP. Thanks again for
reporting all the details!
Cheng
On 1/13/15 12:56 PM, Yana Kadiyska wrote:
Attempting to bump this up in case someone can help out after
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're
missing --master. In addition, it's a good idea to pass with --master
exactly the spark master's endpoint as shown on your UI under
http://localhost:8080. But that should do it. If that's not working, you
can look at the
I am running the following (connecting to an external Hive Metastore)
/a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf
*spark.sql.parquet.filterPushdown=true*
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
and then ran two queries:
sqlContext.sql(select count(*)
are you calling the saveAsText files on the DStream --looks like it? Look
at the section called Design Patterns for using foreachRDD in the link
you sent -- you want to do dstream.foreachRDD(rdd = rdd.saveAs)
On Thu, Jan 8, 2015 at 5:20 PM, Su She suhsheka...@gmail.com wrote:
Hello
Hi folks, puzzled by something pretty simple:
I have a standalone cluster with default parallelism of 2, spark-shell
running with 2 cores
sc.textFile(README.md).partitions.size returns 2 (this makes sense)
sc.textFile(README.md).coalesce(100,true).partitions.size returns 100,
also makes sense
On Wednesday, January 7, 2015 5:25 PM, Laeeq Ahmed
laeeqsp...@yahoo.com.INVALID wrote:
Hi Yana,
I also think that
*val top10 = your_stream.mapPartitions(rdd = rdd.take(10))*
will give top 10 from each partition. I will try your code.
Regards,
Laeeq
On Wednesday, January 7, 2015 5:19 PM, Yana
You might also want to see if TaskScheduler helps with that. I have not
used it with Windows 2008 R2 but it generally does allow you to schedule a
bat file to run on startup
On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV)
ningjun.w...@lexisnexis.com wrote:
Thanks for the suggestion.
:
Hi yana,
I have removed hive-site.xml from spark/conf directory but still getting
the same errors. Anyother way to work around.
Regards,
Sandeep
On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
I think you're mixing two things: the docs say When
I think you're mixing two things: the docs say When* not *configured by
the hive-site.xml, the context automatically creates metastore_db and
warehouse in the current directory.. AFAIK if you want a local metastore,
you don't put hive-site.xml anywhere. You only need the file if you're
going to
the program after setting the property
spark.scheduler.mode to FAIR. But the result is same as previous. Are
there any other properties that have to be set?
On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
It's hard to tell. I have not run this on EC2 but this worked
Hi all,
I had a streaming application and midway through things decided to up the
executor memory. I spent a long time launching like this:
~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest
--executor-memory 2G --master...
and observing the executor memory is still at old 512
config took affect. Maybe. :)
TD
On Sat, Feb 21, 2015 at 7:30 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi all,
I had a streaming application and midway through things decided to up the
executor memory. I spent a long time launching like this:
~/spark-1.2.0-bin-cdh4/bin/spark-submit
Shark used to have shark.map.tasks variable. Is there an equivalent for
Spark SQL?
We are trying a scenario with heavily partitioned Hive tables. We end up
with a UnionRDD with a lot of partitions underneath and hence too many
tasks:
It's hard to tell. I have not run this on EC2 but this worked for me:
The only thing that I can think of is that the scheduling mode is set to
- *Scheduling Mode:* FAIR
val pool: ExecutorService = Executors.newFixedThreadPool(poolSize)
while_loop to get curr_job
pool.execute(new
Can someone confirm if they can run UDFs in group by in spark1.2?
I have two builds running -- one from a custom build from early December
(commit 4259ca8dd12) which works fine, and Spark1.2-RC2.
On the latter I get:
jdbc:hive2://XXX.208:10001 select
Imran, I have also observed the phenomenon of reducing the cores helping
with OOM. I wanted to ask this (hopefully without straying off topic): we
can specify the number of cores and the executor memory. But we don't get
to specify _how_ the cores are spread among executors.
Is it possible that
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the
shuffle setting: spark.sql.shuffle.partitions
On Thu, Feb 26, 2015 at 5:51 PM, java8964 java8...@hotmail.com wrote:
Imran, thanks for your explaining about the parallelism. That is very
helpful.
In my test case, I am
.
On Fri, Jan 9, 2015 at 2:56 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
I am running the following (connecting to an external Hive Metastore)
/a/shark/spark/bin/spark-shell --master spark://ip:7077 --conf
*spark.sql.parquet.filterPushdown=true*
val sqlContext = new
Hi folks, having some seemingly noob issues with the dataframe API.
I have a DF which came from the csv package.
1. What would be an easy way to cast a column to a given type -- my DF
columns are all typed as strings coming from a csv. I see a schema getter
but not setter on DF
2. I am trying
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server
so far and using JDBC prepared statements to escape potentially malicious
user input.
I am trying to port our code directly to HiveContext now (i.e. eliminate
the use of Thrift Server) and I am not quite sure how to
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928
Looks like for now you'd have to list the full paths...I don't see a
comment from an official spark committer so still not sure if this is a bug
or design, but it seems to be the current state of affairs.
On Thu, May 7, 2015 at
1 - 100 of 174 matches
Mail list logo