There shouldn't be anything Mac OS specific about this feature. One point
of warning though -- As mentioned previously in this thread the APIs were
made private because we aren't sure we will be supporting them in the
future. If you are using these APIs it would be good to chime in on the
JIRA
Hi Andrew Or,
Yes, NodeManager was restarted, I also checked the logs to see if the JARs
appear in the CLASSPATH.
I have also downloaded the binary distribution and use the JAR
spark-1.4.1-bin-hadoop2.4/lib/spark-1.4.1-yarn-shuffle.jar without success.
Has anyone successfully enabled the
ZhuGe,
If you run your program in the cluster deploy-mode you get resiliency against
driver failure, though there are some steps you have to take in how you write
your streaming job to allow for transparent resume. Netflix did a nice writeup
of this resiliency
TD's Spark Summit talk offers suggestions (
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/).
He recommends using HDFS, because you get the triplicate resiliency it
offers, albeit with extra overhead. I believe the driver doesn't need
visibility
I have never used Mahout, so cannot compare the two. Spark MLlib, however,
provides matrix factorization based Collaborative Filtering
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html
using Alternating Least Squares algorithm.
Also, Singular Value Decomposition is handled
You can certainly create threads in a map transformation. We do this to do
concurrent DB lookups during one stage for example. I would recommend,
however, that you switch to mapPartitions from map as this allows you to
create a fixed size thread pool to share across items on a partition as
opposed
Thank you for your reply. I will consider hdfs for the checkpoint storage.
Le mar. 21 juil. 2015 à 17:51, Dean Wampler deanwamp...@gmail.com a
écrit :
TD's Spark Summit talk offers suggestions (
Hi, I have a simple High level Kafka Consumer like :
package matchinguu.kafka.consumer;
import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;
import
I am using Accumulators in a JavaRDDLabeledPoint.flatMap() method. I have
logged the localValue() of the Accumulator object and their values are
non-zero. In the driver, after the .flatMap() method returns, the calling
value() on the Accumulator yields 0. I am running 1.4.0 in yarn-client
Hi I could successfully install SparkR package into my RStudio but I could
not execute anything against sc or sqlContext. I did the following:
Sys.setenv(SPARK_HOME=/path/to/sparkE1.4.1)
.libPaths(c(file.path(Sys.getenv(SPARK_HOME),R,lib),.libPaths()))
library(SparkR)
Above code installs
Have you called collect() / count() on the RDD following flatMap() ?
Cheers
On Tue, Jul 21, 2015 at 8:47 AM, dlmar...@comcast.net wrote:
I am using Accumulators in a JavaRDDLabeledPoint.flatMap() method. I
have logged the localValue() of the Accumulator object and their values are
Hi Tal,
I'm not sure there is currently a built-in function for it, but you can
easily define a UDF (user defined function) by extending
org.apache.spark.sql.api.java.UDF1, registering it
(sparkContext.udf().register(...)), and then use it inside your query.
RK.
On Tue, Jul 21, 2015 at 7:04
FWIW I've run into similar BLAS related problems before and wrote up a
document on how to do this for Spark EC2 clusters at
https://github.com/amplab/ml-matrix/blob/master/EC2.md -- Note that this
works with a vanilla Spark build (you only need to link to netlib-lgpl in
your App) but requires the
Hi,
I'm running a query with sql context where one of the fields is of type
java.sql.Timestamp. I'd like to set a function similar to DATEDIFF in
mysql, between the date given in each row, and now. So If I was able to use
the same syntax as in mysql it would be:
val date_diff_df =
The thread dump is here, seems hang on accessing mysql meta store.
I googled and find a bug related to com.mysql.jdbc.util.ReadAheadInputStream,
but don't have a workaround.
And I am not sure about that. please help me. thanks.
thread dump---
MyAppDefaultScheduler_Worker-2 prio=10
Hi,
I have modified some Hadoop code, and want to build Spark with the modified
version of Hadoop. Do I need to change the compilation dependency files?
How to then? Great thanks!
Quick example problem that's stumping me:
* Users have 1 or more phone numbers and therefore one or more area codes.
* There are 100M users.
* States have one or more area codes.
* I would like to the states for the users (as indicated by phone area
code).
I was thinking about something like
Yeah, I'm referring to that api.
If you want to filter messages in addition to catching that exception, have
your mesageHandler return an option, so the type R would end up being
Option[WhateverYourClassIs], then filter out None before doing the rest of
your processing.
If you aren't already
Hi thanks for the reply. I did download from github build it and it is
working fine I can use spark-submit etc when I use it in RStudio I dont know
why it is saying sqlContext not found
When I do the following
sqlContext sparkRSQL.init(sc)
Error: object sqlContext not found
if I do the
Hi,
I'm working on a Spark Streaming application and I would like to know what
is the best storage to use
for checkpointing.
For testing purposes we're are using NFS between the worker, the master and
the driver program (in client mode),
but we have some issues with the CheckpointWriter (1
I'm trying to define a class that contains as attributes some of Spark's
objects and am running into a problem that I think would be solved I can
find python's equivalent of Scala's Extends Serializable.
Here's a simple class that has a Spark RDD as one of its attributes.
class Foo:
def
Thanks All!
thanks Ayan!
I did the repartition to 20 so it used all cores in the cluster and was
done in 3 minutes. seems data was skewed to this partition.
On Tue, Jul 14, 2015 at 8:05 PM, ayan guha guha.a...@gmail.com wrote:
Hi
As you can see, Spark has taken data locality into
Hi Andrew,
Based on your driver logs, it seems the issue is that the shuffle service
is actually not running on the NodeManagers, but your application is trying
to provide a spark_shuffle secret anyway. One way to verify whether the
shuffle service is actually started is to look at the
No, I have not. I will try that though, thank you.
- Original Message -
From: Ted Yu yuzhih...@gmail.com
To: dlmarion dlmar...@comcast.net
Cc: user user@spark.apache.org
Sent: Tuesday, July 21, 2015 12:15:13 PM
Subject: Re: Accumulator value 0 in driver
Have you called collect()
That did it, thanks.
- Original Message -
From: dlmar...@comcast.net
To: Ted Yu yuzhih...@gmail.com
Cc: user user@spark.apache.org
Sent: Tuesday, July 21, 2015 1:15:37 PM
Subject: Re: Accumulator value 0 in driver
No, I have not. I will try that though, thank you.
-
Hi,
Could you please decrease your step size to 0.1, and also try 0.01? You
could also try running L-BFGS, which doesn't have step size tuning, to get
better results.
Best,
Burak
On Tue, Jul 21, 2015 at 2:59 AM, Naveen nav...@formcept.com wrote:
Hi ,
I am trying to use
*java:*
java version 1.7.0_60
Java(TM) SE Runtime Environment (build 1.7.0_60-b19)
Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed mode)
*scala*
Scala code runner version 2.10.5 -- Copyright 2002-2013, LAMP/EPFL
*hadoop cluster:*
with 51 Servers that hadoop-2.3-cdh-5.1.0 version ,and
Will work. Thanks!
zipWithUniqueId() doesn't guarantee continuous ID either.
Srikanth
On Tue, Jul 21, 2015 at 9:48 PM, Burak Yavuz brk...@gmail.com wrote:
Would monotonicallyIncreasingId
On 7/22/15 9:03 AM, Ankit wrote:
Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet
ENUMs were treated as Strings in Spark SQL right? So does this mean
partitioning for enums already works in previous versions too since
they are just treated as strings?
It’s a little bit
I can post multiple items at a time.
Data is being read from kafka and filtered after that its posted .
Does foreachPartition
load complete partition in memory or use an iterator of batch underhood? If
compete batch is not loaded will using custim size of 100-200 request in
one batch and post
Would monotonicallyIncreasingId
https://github.com/apache/spark/blob/d4c7a7a3642a74ad40093c96c4bf45a62a470605/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L637
work for you?
Best,
Burak
On Tue, Jul 21, 2015 at 4:55 PM, Srikanth srikanth...@gmail.com wrote:
Hello,
I'm
Hi Rozen,
you can get current time by call a java API and then get rowTimestamp by sql;
val currentTimeStamp=System.currentTimeMillis()val rowTimestatm =
sqlContext.sql(select rowTimestamp from tableName)
and then you can wirte a function like this
def
Either you have to do rdd.collect and then broadcast or you can do a join
On 22 Jul 2015 07:54, Dan Dong dongda...@gmail.com wrote:
Hi, All,
I am trying to access a Map from RDDs that are on different compute nodes,
but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3,...)
Thanks a lot Cheng. So it seems even in spark 1.3 and 1.4, parquet ENUMs
were treated as Strings in Spark SQL right? So does this mean partitioning
for enums already works in previous versions too since they are just
treated as strings?
Also, is there a good way to verify that the partitioning is
Hi Lian,
Sorry I'm new to Spark so I did not express myself very clearly. I'm
concerned about the situation when let's say I have a Parquet table some
partitions and I add a new column A to parquet schema and write some data
with the new schema to a new partition in the table. If i'm not
From what I understand about your code, it is getting data from different
partitions of a topic - get all data from partition 1, then from partition
2, etc. Though you have configured it to read from just one partition
(topicCount has count = 1). So I am not sure what your intention is, read
all
Is it fair to say that Storm stream processing is completely in memory, whereas
spark streaming would take a disk hit because of how shuffle works?
Does spark streaming try to avoid disk usage out of the box?
-Abhishek-
-
To
I'm sorry, I have no idea why it is failing on your side.I have been using
this for a while now and it works fine.All I can say is use version 1.4.0
but I don't think so it is going to make a big difference.This is the one
which I use,a/b are my directories.
Hello everyone,
Does spark thrift server support timeout?
Is there a documentation I can reference for questions like these?
I know it support cancels, but not sure about timeout.
Thanks,
Judy
Yep,I saw that in your previous post and I thought it was a typing mistake
that you did while posting,I never imagined that it was done on R
studio.Glad it worked.
--
View this message in context:
Try instead:
import org.apache.spark.sql.functions._
val determineDayPartID = udf((evntStDate: String, evntStHour: String) = {
val stFormat = new java.text.SimpleDateFormat(yyMMdd)
var stDateStr:String = evntStDate.substring(2,8)
val stDate:Date = stFormat.parse(stDateStr)
val
This is working!
Thank you so much :)
Stefan Panayotov, PhD
Home: 610-355-0919
Cell: 610-517-5586
email: spanayo...@msn.com
spanayo...@outlook.com
spanayo...@comcast.net
From: mich...@databricks.com
Date: Tue, 21 Jul 2015 12:08:04 -0700
Subject: Re: Add column to DF
To: spanayo...@msn.com
Hi, I am using a custom build of spark 1.4 with the parquet dependency
upgraded to 1.7. I have thrift data encoded with parquet that i want to
partition by a column of type ENUM. Spark programming guide says partition
discovery is only supported for string and numeric columns, so it seems
Hi,
I am trying to ad a column to a data frame that I created based on a JSON file
like this:
val input =
hiveCtx.jsonFile(wasb://n...@cmwhdinsightdatastore.blob.core.windows.net/json/*).toDF().persist(StorageLevel.MEMORY_AND_DISK)
I have a function that is generating the values for the new
If you can post multiple items at a time, then use foreachPartition to post
the whole partition in a single request.
On Tue, Jul 21, 2015 at 9:35 AM, Richard Marscher rmarsc...@localytics.com
wrote:
You can certainly create threads in a map transformation. We do this to do
concurrent DB
Hi, All,
I am trying to access a Map from RDDs that are on different compute nodes,
but without success. The Map is like:
val map1 = Map(aa-1,bb-2,cc-3,...)
All RDDs will have to check against it to see if the key is in the Map or
not, so seems I have to make the Map itself global, the problem
Hi
I am testing Spark on Amazon EMR using Python and the basic wordcount
example shipped with Spark.
After running the application, I realized that in Stage 0 reduceByKey(add),
around 2.5GB shuffle is spilled to memory and 4GB shuffle is spilled to
disk. Since in the wordcount example I am not
Hi Andrew,
Thanks for the advice. I didn't see the log in the NodeManager, so apparently,
something was wrong with the yarn-site.xml configuration.
After digging in more, I realize it was an user error. I'm sharing this with
other people so others may know what mistake I have made.
When I review
Thanks TD - appreciate the response !
On Jul 21, 2015, at 1:54 PM, Tathagata Das t...@databricks.com wrote:
Most shuffle files are really kept around in the OS's buffer/disk cache, so
it is still pretty much in memory. If you are concerned about performance,
you have to do a holistic
Hi All ,
I am having a class loading issue as Spark Assembly is using google guice
internally and one of Jar i am using uses sisu-guice-3.1.0-no_aop.jar , How
do i load my class first so that it doesn't result in error and tell spark
to load its assembly later on
Ashish
Most shuffle files are really kept around in the OS's buffer/disk cache, so
it is still pretty much in memory. If you are concerned about performance,
you have to do a holistic comparison for end-to-end performance. You could
take a look at this.
A few questions about caching a table in Spark SQL.
1) Is there any difference between caching the dataframe and the table?
df.cache() vs sqlContext.cacheTable(tableName)
2) Do you need to warm up the cache before seeing the performance
benefits? Is the cache LRU? Do you need to run some
depends on your data and I guess the time/performance goals you have for
both training/prediction, but for a quick answer : yes :)
2015-07-21 11:22 GMT+02:00 Chintan Bhatt chintanbhatt...@charusat.ac.in:
Which classifier can be useful for mining massive datasets in spark?
Decision Tree can be
I am new to Spark and I understand that Spark divides the executor memory
into the following fractions:
*RDD Storage:* Which Spark uses to store persisted RDDs using .persist() or
.cache() and can be defined by setting spark.storage.memoryFraction (default
0.6)
*Shuffle and aggregation buffers:*
I have a number of CSV files and need to combine them into a RDD by part of
their filenames.
For example, for the below files
$ ls
20140101_1.csv 20140101_3.csv 20140201_2.csv 20140301_1.csv
20140301_3.csv 20140101_2.csv 20140201_1.csv 20140201_3.csv
I need to combine files with names
How to load dataset in apache spark?
Can I know sources of massive datasets?
On Wed, Jul 22, 2015 at 4:50 AM, Ron Gonzalez zlgonza...@yahoo.com.invalid
wrote:
I'd use Random Forest. It will give you better generalizability. There
are also a number of things you can do with RF that allows to
Hi Akhil and Jorn,
I tried as you suggested to create some simple scenario, but I have an
error on rdd.join(newRDD): value join is not a member of
org.apache.spark.rdd.RDD[twitter4j.Status]. The code looks like:
val stream = TwitterUtils.createStream(ssc, auth)
val filteredStream=
Hi,
Question on using spark sql.
Can someone give an example for creating table from a directory
containing parquet files in HDFS instead of an actual parquet file?
Thanks,
Ron
On 07/21/2015 01:59 PM, Brandon White wrote:
A few questions about caching a table in Spark SQL.
1) Is there
I'd use Random Forest. It will give you better generalizability. There
are also a number of things you can do with RF that allows to train on
samples of the massive data set and then just average over the resulting
models...
Thanks,
Ron
On 07/21/2015 02:17 PM, Olivier Girardot wrote:
depends
Parquet support for Thrift/Avro/ProtoBuf ENUM types are just added to
the master branch. https://github.com/apache/spark/pull/7048
ENUM types are actually not in the Parquet format spec, that's why we
didn't have it at the first place. Basically, ENUMs are always treated
as UTF8 strings in
Hey Jerrick,
What do you mean by schema evolution with Hive metastore tables? Hive
doesn't take schema evolution into account. Could you please give a
concrete use case? Are you trying to write Parquet data with extra
columns into an existing metastore Parquet table?
Cheng
On 7/21/15 1:04
https://spark.apache.org/docs/latest/sql-programming-guide.html#loading-data-programmatically
On Tue, Jul 21, 2015 at 4:06 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid
wrote:
Hi,
Question on using spark sql.
Can someone give an example for creating table from a directory
containing
Thanks, that works a lot better ;)
scala val results =sqlContext.sql(select movies.title, movierates.maxr,
movierates.minr, movierates.cntu from(SELECT ratings.product,
max(ratings.rating) as maxr, min(ratings.rating) as minr,count(distinct
user) as cntu FROM ratings group by ratings.product )
You can add the jar in the classpath, and you can set the property like:
sc.hadoopConfiguration.set(fs.s3a.endpoint,storage.sigmoid.com)
Thanks
Best Regards
On Mon, Jul 20, 2015 at 9:41 PM, Schmirr Wurst schmirrwu...@gmail.com
wrote:
Thanks, that is what I was looking for...
Any Idea
I'd suggest you upgrading to 1.4 as it has better metrices and UI.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 7:01 PM, Shushant Arora shushantaror...@gmail.com
wrote:
Is coalesce not applicable to kafkaStream ? How to do coalesce on
kafkadirectstream its not there in api ?
Shall calling
Do you have HADOOP_HOME, HADOOP_CONF_DIR and hadoop's winutils.exe in the
environment?
Thanks
Best Regards
On Mon, Jul 20, 2015 at 5:45 PM, nitinkalra2000 nitinkalra2...@gmail.com
wrote:
Hi All,
I am working on Spark 1.4 on windows environment. I have to set eventLog
directory so that I can
I also ran into a similar use case. Is this possible?
On 15 July 2015 at 18:12, Michel Hubert mich...@vsnsystemen.nl wrote:
Hi,
I want to implement a time-out mechanism in de updateStateByKey(…)
routine.
But is there a way the retrieve the time of the start of the batch
That works! Thanks.
Can I ask you one further question?
How did spark sql support insertion?
That is say, if I did:
sqlContext.sql(insert into newStu values (“10”,”a”,1)
the error is:
failure: ``table'' expected but identifier newStu found
insert into newStu values ('10', aa, 1)
but if I did:
Hi
Can I create user threads in executors.
I have a streaming app where after processing I have a requirement to push
events to external system . Each post request costs ~90-100 ms.
To make post parllel, I can not use same thread because that is limited by
no of cores available in system , can I
Hi Akhil,
I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's
the configuration required for this ? From where can I get winutils.exe ?
Thanks and Regards,
Nitin Kalra
On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Do you have HADOOP_HOME,
To my knowledge, there is no HA for External Shuffle Service.
Cheers
On Jul 21, 2015, at 2:16 AM, JoneZhang joyoungzh...@gmail.com wrote:
There is a saying If the service is enabled, Spark executors will fetch
shuffle files from the service instead of from each other. in the wiki
Which version do you have ?
- I tried with spark 1.4.1 for hdp 2.6, but here I had an issue that
the aws-module is not there somehow:
java.io.IOException: No FileSystem for scheme: s3n
the same for s3a :
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class
Hi ,
I am trying to use LinearRegressionWithSGD on Million Song Data Set and
my model returns NaN's as weights and 0.0 as the intercept. What might
be the issue for the error ? I am using Spark 1.40 in standalone mode.
Below is my model:
val numIterations = 100
val stepSize = 1.0
val
Hi all:I am a bit confuse about the work of driver.In our productin enviroment,
we have a spark streaming app running in standone mode. what we concern is that
if the driver shutdown accidently(host shutdown or whatever). would the app
running normally?
Any explanation would be appreciated!!
Jack,
You can refer the hive sql syntax if you use HiveContext:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML
Thanks!
-Terry
That works! Thanks.
Can I ask you one further question?
How did spark sql support insertion?
That is say, if I did:
sqlContext.sql(insert
Hi,
I am trying to start a driver app as a daemon using Linux' start-stop-daemon
script (I need console detaching, unbuffered STDOUT/STDERR to logfile and
start/stop using a PID file).
I am doing this like this (which works great for the other apps we have)
/sbin/start-stop-daemon -c $USER
Hi everyone,
I have pretty challenging problem with reading/writing multiple parquet
files with streaming, but let me introduce my data flow:
I have a lot of json events streaming to my platform. All of them have the
same structure, but fields are mostly optional. Some of the fields are
arrays
Great, and that file exists on HDFS and is world readable? just
double-checking.
What classpath is this -- your driver or executor? this is the driver, no?
I assume so just because it looks like it references the assembly you built
locally and from which you're launching the driver.
I think
Guys
Any help would be great!!
I am trying to use DF and save it to Elasticsearch using newHadoopApi
(because I am using python). Can anyone guide me to help if this is even
possible? I have managed to use df.rdd to complete es integration but I
preferred df.write. is it possible or upcoming?
Hi Cody,
Thanks for your answer. I'm with Spark 1.3.0. I don't quite understand how
to use the messageHandler parameter/function in the createDirectStream
method. You are referring to this, aren't you ?
def createDirectStream[ K: ClassTag, V: ClassTag, KD : Decoder[K]: ClassTag
, VD :
I maybe find the answer from the sqlparser.scala file.
Looks like the syntax spark used for insert is different from what we normally
used for MySQL.
I hope if someone can confirm this. Also I will appreciate if there is a SQL
reference list available.
Sent from my iPhone
On 21 Jul 2015, at
No. I did not use hiveContext at this stage.
I am talking the embedded SQL syntax for pure spark sql.
Thanks, mate.
On 21 Jul 2015, at 6:13 pm, Terry Hole
hujie.ea...@gmail.commailto:hujie.ea...@gmail.com wrote:
Jack,
You can refer the hive sql syntax if you use HiveContext:
thanks for this information, but i use cloudera live 5.4.4, and that have
only spark 1.3. a newer version is not avaible.
i don't understand this problem, first it compute some iterations and than
it stop better do nothing. i think the problem
is not find in the program code.
maybe you know a
Here's two ways of doing that:
Without the filter function :
JavaPairDStreamString, String foo =
ssc.String, String, SequenceFileInputFormatfileStream(/tmp/foo);
With the filter function:
JavaPairInputDStreamLongWritable, Text foo = ssc.fileStream(/tmp/foo,
LongWritable.class,
Consider the following code:
val df = Seq((1, 3), (2, 3)).toDF(key, value).registerTempTable(tbl)
sqlContext.sql(select key, null as value from tbl)
.write.format(json).mode(SaveMode.Overwrite).save(test)
sqlContext.read.format(json).load(test).printSchema()
It shows:
root
|-- key: long
Did you try with s3a? It seems its more like an issue with hadoop.
Thanks
Best Regards
On Tue, Jul 21, 2015 at 2:31 PM, Schmirr Wurst schmirrwu...@gmail.com
wrote:
It seems to work for the credentials , but the endpoint is ignored.. :
I've changed it to
Which classifier can be useful for mining massive datasets in spark?
Decision Tree can be good choice as per scalability?
--
CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/
Assistant Professor,
U P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of
I might add to this that I've done the same exercise on Linux (CentOS 6) and
there, broadcast variables ARE working. Is this functionality perhaps not
exposed on Mac OS X? Or has it to do with the fact there are no native
Hadoop libs for Mac?
--
View this message in context:
Here are some resources which will help you with that.
-
http://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path
- https://issues.apache.org/jira/browse/SPARK-2356
Thanks
Best Regards
On Tue, Jul 21, 2015 at 1:57 PM, Nitin Kalra
Hi I want to ask an issue I have faced while using Spark. I load dataframes
from parquet files. Some dataframes' parquet have lots of partitions, 10
million rows.
Running where id = x query on dataframe scans all partitions. When saving
to rdd object/parquet there is a partition column. The
It seems to work for the credentials , but the endpoint is ignored.. :
I've changed it to sc.hadoopConfiguration.set(fs.s3n.endpoint,test.com)
And I continue to get my data from amazon, how could it be ? (I also
use s3n in my text url)
2015-07-21 9:30 GMT+02:00 Akhil Das
There is a saying If the service is enabled, Spark executors will fetch
shuffle files from the service instead of from each other. in the wiki
https://spark.apache.org/docs/1.3.0/job-scheduling.html#graceful-decommission-of-executors
Please see the tail of:
https://issues.apache.org/jira/browse/SPARK-2356
On Jul 21, 2015, at 1:27 AM, Nitin Kalra nitinkalra2...@gmail.com wrote:
Hi Akhil,
I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's
the configuration required for this ? From where can I
Maybe you can try: spark-submit --class sparkwithscala.SqlApp --jars
/home/lib/mysql-connector-java-5.1.34.jar --master spark://hadoop1:7077
/home/myjar.jar
Thanks!
-Terry
Hi there,
I would like to use spark to access the data in mysql. So firstly I tried
to run the program using:
Hello,
*TL;DR: task crashes with OOM, but application gets stuck in infinite loop
retrying the task over and over again instead of failing fast.*
Using Spark 1.4.0, standalone, with DataFrames on Java 7.
I have an application that does some aggregations. I played around with
shuffling settings,
I have a spark program that uses dataframes to query hive and I run it both
as a spark-shell for exploration and I have a runner class that executes
some tasks with spark-submit. I used to run against 1.4.0-SNAPSHOT. Since
then 1.4.0 and 1.4.1 were released so I tried to switch to the official
Hi,
I have log4j.xml in my jar
From 1.4.1 it seems that log4j.properties in spark/conf is defined first in
classpath so the spark.conf/log4j.properties wins
before that (in v1.3.0) log4j.xml bundled in jar defined the configuration
if I manually add my jar to be strictly first in classpath(by
It could be a GC pause or something, you need to check in the stages tab
and see what is taking time, If you upgrade to Spark 1.4, it has better UI
and DAG visualization which helps you debug better.
Thanks
Best Regards
On Mon, Jul 20, 2015 at 8:21 PM, Pa Rö paul.roewer1...@googlemail.com
wrote:
Yes, I imagine it's the driver's classpath - I'm pulling those screenshots
straight from the Spark UI environment page. Is there somewhere else to
grab the executor class path?
Also, the warning is only printing once, so it's also not clear whether the
warning is from the driver or exectuor,
99 matches
Mail list logo