Hi,
Nopes. I was trying to use EC2(due to a few constraints) where I faced the
problem.
With EMR, it works flawlessly.
But, I would like to go back and use EC2 if I can fix this issue.
Has anybody set up a spark cluster using plain EC2 machines. What steps did
you follow?
Thanks and Regards,
Hi,
Are you using EMR ?
Natu
On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH wrote:
> Hi Ankur,
> Thanks for the reply.
> This is already done.
> If I wait for a long amount of time(10 minutes), a few tasks get
> successful even on slave nodes. Sometime, a fraction of the
Hi Ankur,
Thanks for the reply.
This is already done.
If I wait for a long amount of time(10 minutes), a few tasks get successful
even on slave nodes. Sometime, a fraction of the tasks(20%) are completed
on all the machines in the initial 5 seconds and then, it slows down
drastically.
Thanks and
Hi all,
I am trying to learn about UDAF and implemented a simple reservoir sample
UDAF. It's working fine. However I am not able to figure out what DataType
should I use so that its can deal with all DataTypes (simple and complex).
For instance currently I have defined my input schema as
def
Either setting it programatically doesn't work:
sparkConf.setIfMissing("class", "...Main")
In my current setting moving main to another package requires to propagate
change to deploy scripts. Doesn't matter I will find some other way. Petr
On Fri, Sep 25, 2015 at 4:40 PM, Petr Novak
I am seeing a similar issue when reading from Kafka. I have a single Kafka
broker with 1 topic and 10 partitions on a separate machine. I have a
three-node spark cluster, and verified that all workers are registered with
the master. I'm initializing Kafka using a similar method to this article:
Hello,
It worked like a charm. Thank you very much.
Some userid’s were null that’s why many records go to userid ’null’. When i put
a where clause: userid != ‘null’, it solved problem.
> On 24 Sep 2015, at 22:43, java8964 wrote:
>
> I can understand why your first query
Good catch, I was not aware of this setting.
I’m wondering though if it also generates a shuffle or if the data is still
processed by the node on which it’s ingested - so that you’re not gated by the
number of cores on one machine.
-adrian
On 9/25/15, 5:27 PM, "Silvio Fiorito"
Storing passbacks transactionally with results in your own data store, with
a schema that makes sense for you, is the optimal solution.
On Fri, Sep 25, 2015 at 11:05 AM, Radu Brumariu wrote:
> Right, I understand why the exceptions happen.
> However, it seems less useful to
Right, I understand why the exceptions happen.
However, it seems less useful to have a checkpointing that only works in
the case of an application restart. IMO, code changes happen quite often,
and not being able to pick up where the previous job left off is quite a
bit of a hinderance.
The
Ortherwise it seems it tries to load from a checkpoint which I have deleted
and cannot be found. Or it should work and I have wrong something else.
Documentation doesn't mention option with jar manifest, so I assume it
doesn't work this way.
Many thanks,
Petr
Hi Utkarsh,
What is your job placement like when you run fine grain mode? You said
coarse grain mode only ran with one node right?
And when the job is running could you open the Spark webui and get stats
about the heap size and other java settings?
Tim
On Thu, Sep 24, 2015 at 10:56 PM, Utkarsh
It seems that is due to spark SPARK_LOCAL_IP setting.export
SPARK_LOCAL_IP=localhost
will not work.
Then, how it would be set.
Thank you all~~
On Friday, September 25, 2015 5:57 PM, Zhiliang Zhu
wrote:
Hi Steve,
Thanks a lot for your reply.
That
Hello,
I'm trying to start the SparkSQL thriftserver over YARN, connecting it to
the 1.1.0-cdh5.4.3 hive metastore that we already have in production.
I downloaded the latest version of Spark (1.5.0), I just followed the
instructions from the documentation
Hi,
We have an application that submits several thousands jobs within the same
SparkContext, using a thread pool to run about 50 in parallel. We're
running on YARN using Spark 1.4.1 and seeing a problem where our driver is
killed by YARN due to running beyond physical memory limits (no Java OOM
I tried but I'm getting the same error (task not serializable)
> On 25 בספט׳ 2015, at 20:10, Ted Yu wrote:
>
> Is the Schema.parse() call expensive ?
>
> Can you call it in the closure ?
>
>> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv
>>
Hi Akhil,
I do have 25 partitions being created. I have set
the spark.default.parallelism property to 25. Batch size is 30 seconds and
block interval is 1200 ms which also gives us roughly 25 partitions from
the input stream. I can see 25 partitions being created and used in the
Spark UI also.
Seems like you have "hive.server2.enable.doAs" enabled; you can either
disable it, or configure hs2 so that the user running the service
("hadoop" in your case) can impersonate others.
See:
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html
On Fri, Sep 25,
It looks like the distance metric is hard coded to the L2 norm (euclidean
distance) in MLlib. As you may expect, you are not the first person to
desire other metrics and there has been some prior effort.
Please reference this PR: https://github.com/apache/spark/pull/2634
And corresponding JIRA:
Is it possible to use other distance metrics than Euclidean (e.g. Tanimoto,
Manhattan) with MLlib KMeans?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823.html
Sent from the Apache Spark User List mailing list archive at
Hi,
We are running Spark 1.3 on CDH 5.4.1 on top of YARN. we want to know how
do we control task timeout when node fails and task running on it should be
restarted on another node. at present job wait for approximately 10 min to
restart the task were running on failed node.
Yes, the partition IDs are the same.
As far as the failure / subclassing goes, you may want to keep an eye on
https://issues.apache.org/jira/browse/SPARK-10320 , not sure if the
suggestions in there will end up going anywhere.
On Fri, Sep 25, 2015 at 3:01 PM, Neelesh wrote:
Thanks. Ill keep an eye on this. Our implementation of the DStream
basically accepts a function to compute current offsets. The implementation
of the function fetches list of topics from zookeeper once in while. It
then adds consumer offsets for newly added topics with the currentOffsets
thats in
Hi, Spark Users:
I have a problem related to Spark cannot recognize the string type in the
Parquet schema generated by Hive.
Version of all components:
Spark 1.3.1Hive 0.12.0Parquet 1.3.2
I generated a detail low level table in the Parquet format using MapReduce java
code. This table can be read
Please set the the SQL option spark.sql.parquet.binaryAsString to true
when reading Parquet files containing strings generated by Hive.
This is actually a bug of parquet-hive. When generating Parquet schema
for a string field, Parquet requires a "UTF8" annotation, something like:
message
BTW, just checked that this bug should have been fixed since Hive
0.14.0. So the SQL option I mentioned is mostly used for reading legacy
Parquet files generated by older versions of Hive.
Cheng
On 9/25/15 2:42 PM, Cheng Lian wrote:
Please set the the SQL option
Hi folks,
I have a set of categorical columns (strings), that I'm parsing and
converting into Vectors of features to pass to a mllib classifier (random
forest).
In my input data, some columns have null values. Say, in one of those
columns, I have p values + a null value :
How should I build my
Looking at this further, it appears that my Spark Context is not correctly
setting the Master name. I see the following in logs:
15/09/25 16:45:42 INFO DriverRunner: Launch Command:
"/usr/lib/jvm/java-7-openjdk-amd64/jre/bin/java" "-cp"
Hi All,
I am following
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started?
to setup hive on spark. After setup/configuration everything startup I am
able to show tables but when executing sql statement within beeline I got
error. Please help and
Thanks Petr, Cody. This is a reasonable place to start for me. What I'm
trying to achieve
stream.foreachRDD {rdd=>
rdd.foreachPartition { p=>
Try(myFunc(...)) match {
case Sucess(s) => updatewatermark for this partition //of course,
expectation is that it will work only if
Wouldn't the same case be made for checkpointing in general ?
What I am trying to say, is that this particular situation is part of the
general checkpointing use case, not an edge case.
I would like to understand why shouldn't the checkpointing mechanism,
already existent in Spark, handle this
For the 1-1 mapping case, can I use TaskContext.get().partitionId as an
index in to the offset ranges?
For the failure case, yes, I'm subclassing of DirectKafkaInputDStream. As
for failures, different partitions in the same batch may be talking to
different RDBMS servers due to multitenancy - a
Eventually I'd like to eliminate HiveContext, but for now I just recommend
that most users use it instead of SQLContext.
On Thu, Sep 24, 2015 at 5:41 PM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:
> Thanks Michael. Just want to check if there is a roadmap to include Hive
>
I think I found the problem.
Have to change the yarn capacity scheduler to use
DominantResourceCalculator
Thanks!
On Fri, Sep 25, 2015 at 4:54 AM, Akhil Das
wrote:
> Which version of spark are you having? Can you also check whats set in
> your
import org.apache.spark.mllib.linalg._
val v = Vectors.dense(1.0,2.0)
val rdd = sc.parallelize(v.toArray)
On Fri, Sep 25, 2015 at 2:46 PM, Yusuf Can Gürkan
wrote:
> How can i convert a Vector to RDD[Double]. For example:
>
> val vector = Vectors.dense(1.0,2.0)
> val
Your success case will work fine, it is a 1-1 mapping as you said.
To handle failures in exactly the way you describe, you'd need to subclass
or modify DirectKafkaInputDStream and change the way compute() works.
Unless you really are going to have very fine-grained failures (why would
only a
As Cody says, to achieve true exactly once, the book keeping has to happen
in the sink data system, that too assuming its a transactional store.
Wherever possible, we try to make the application idempotent (upsert in
HBase, ignore-on-duplicate for MySQL etc), but there are still cases
(analytics,
Can someone please help either by explaining or pointing to documentation
the relationship between #executors needed and How to let the concurrent
jobs that are created by the above parameter run in parallel?
On Thu, Sep 24, 2015 at 11:56 PM, Atul Kulkarni
wrote:
> Hi
Hi everyone,
I have an RDD of the format (user: String, timestamp: Long, state:
Boolean). My task invovles converting the states, where on/off is
represented as true/false, into intervals of 'on' of the format (beginTs:
Long, endTs: Long). So this task requires me, per user, to line up all of
Spark's checkpointing system is not a transactional database, and it
doesn't really make sense to try and turn it into one.
On Fri, Sep 25, 2015 at 2:15 PM, Radu Brumariu wrote:
> Wouldn't the same case be made for checkpointing in general ?
> What I am trying to say, is that
Hi,
I'm getting a NotSerializableException even though I'm creating all the my
objects from within the closure:
import org.apache.avro.generic.GenericDatumReader
import java.io.File
import org.apache.avro._
val orig_schema = Schema.parse(new File("/home/wasabi/schema"))
val READER = new
I'm sorry. Both approaches actually work. It was something else wrong with
my cluster. Petr
On Fri, Sep 25, 2015 at 4:53 PM, Petr Novak wrote:
> Either setting it programatically doesn't work:
> sparkConf.setIfMissing("class", "...Main")
>
> In my current setting moving
Yes you are right. Make the change and also link hive-site.xml into spark conf
directory. Rerun the sql getting error in hive.log
2015-09-25 13:31:14,750 INFO [HiveServer2-Handler-Pool: Thread-125]:
client.SparkClientImpl (SparkClientImpl.java:startDriver(375)) - Attempting
impersonation of
Thanks for the clarification. Could you please provide the full schema
of your table and query plans of your query? You may obtain them via:
hiveContext.table("your_table").printSchema()
and
hiveContext.sql("your query").explain(extended = true)
You also mentioned "Thrift" in the subject,
Hi,
We are using DirectKafkaInputDStream and store completed consumer
offsets in Kafka (0.8.2). However, some of our use case require that
offsets be not written if processing of a partition fails with certain
exceptions. This allows us to build various backoff strategies for that
partition,
You can have offsetRanges on workers f.e.
object Something {
var offsetRanges = Array[OffsetRange]()
def create[F : ClassTag](stream: InputDStream[Array[Byte]])
(implicit codec: Codec[F]: DStream[F] = {
stream transform { rdd =>
offsetRanges =
How can i convert a Vector to RDD[Double]. For example:
val vector = Vectors.dense(1.0,2.0)
val rdd // i need sc.parallelize(Array(1.0,2.0))
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands,
In spark-defaults.conf the spark.master is spark://hostname:7077. From
hive-site.xml
spark.master
hostname
From: Jimmy Xiang [mailto:jxi...@cloudera.com]
Sent: Friday, September 25, 2015 1:00 PM
To: Garry Chen
Cc: user@spark.apache.org
Subject: Re: hive on
Is the Schema.parse() call expensive ?
Can you call it in the closure ?
On Fri, Sep 25, 2015 at 10:06 AM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> I'm getting a NotSerializableException even though I'm creating all the my
> objects from within the closure:
> import
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote:
> In spark-defaults.conf the spark.master is spark://hostname:7077. From
> hive-site.xml
> spark.master
> hostname
>
That's not a valid value for spark.master (as the error indicates).
You should set it to
http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers
also has an example of how to close over the offset ranges so they are
available on executors.
On Fri, Sep 25, 2015 at 12:50 PM, Neelesh wrote:
> Hi,
>We are
Hi Ritesh,
Right now, we only allow specific data types defined in the inputSchema.
Supporting abstract types (e.g. NumericType) may cause the logic of a UDAF
be more complex. It will be great to understand the use cases first. What
kinds of possible input data types that you want to support and
hi Yin,
I have a written a simple UDAF to generate N samples for each group. I am
using reservoir sampling algorithm for this. In this case since the input
data type doesn't matter as I am not doing any kind of processing on the
input data but just selecting them by random and building an array
So I have a large data structure that I want to broadcast to my executors. It
is so large that it makes sense to share access to the object between multiple
tasks, so I create my executors with multiple cores. Unfortunately, it looks
like the object is not shared between threads, but is copied
I am trying to run a query on a month of data. The volume of data is not
much, but we have a partition per hour and per day. The table schema is
heavily nested with total of 300 leaf fields. I am trying to run a simple
select count(*) query on the table and running into this exception:
SELECT
Hi,
I am new to Spark and GraphX, so thanks in advance for your patience. I want
to create a graph with multiple node attributes. Here is my code:
But I receive error:
Can someone help? Thanks!
--
View this message in context:
Hello, All,
I found the examples for JDBC connection are mostly read the whole table
and then do operations like joining.
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
Sometimes it is not practical
Hello, All,
I found the examples for JDBC connection are mostly read the whole table
and then do operations like joining.
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
Sometimes it is not practical
Hi sparkers,
Anyone knows how to do LATERAL VIEW EXPLODE without HiveContext?
I don't want to start up a metastore and derby just because I need LATERAL
VIEW EXPLODE.
I have been trying but I always get the exception like this:
Name: java.lang.RuntimeException
Message: [1.68] failure: ``union''
I have been trying to start p sparkR but always get the error about monitorPort:
export
LD_LIBRARY_PATH=$HOME/jaguar/lib:$HOME/opt/hadoop/lib/nativeJDBCJAR=$HOME/jaguar/lib/jaguar-jdbc-2.0.jarsparkR
\ --driver-class-path $JDBCJAR \ --driver-library-path $HOME/jaguar/lib \
--conf
In your dbtable you can insert "select ..." instead of table name. I never
tried, but saw example from the web. Best regards, Jonathan
From: Cui Lin
To: user
Sent: Friday, September 25, 2015 4:12 PM
Subject: Spark for Oracle sample
Hi All,
I would like to submit spark job on some another remote machine outside the
cluster,I also copied hadoop/spark conf files under the remote machine, then
hadoopjob would be submitted, but spark job would not.
In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or
The SQL parser without HiveContext is really simple, which is why I
generally recommend users use HiveContext. However, you can do it with
dataframes:
import org.apache.spark.sql.functions._
table("purchases").select(explode(df("purchase_items")).as("item"))
On Fri, Sep 25, 2015 at 4:21 PM,
In most cases predicates that you add to jdbcDF will be push down into
oracle, preventing the whole table from being sent over.
df.where("column = 1")
Another common pattern is to save the table to parquet or something for
repeat querying.
Michael
On Fri, Sep 25, 2015 at 3:13 PM, Cui Lin
Hi All,
I would like to submit spark job on some another remote machine outside the
cluster,I also copied hadoop/spark conf files under the remote machine, then
hadoopjob would be submitted, but spark job would not.
In spark-env.sh, it may be due to that SPARK_LOCAL_IP is not properly set,or
Print out your env variables and check first
Sent from my iPhone
> On Sep 25, 2015, at 18:43, Zhiliang Zhu wrote:
>
> Hi All,
>
> I would like to submit spark job on some another remote machine outside the
> cluster,
> I also copied hadoop/spark conf files under
Hi Yue,
Thanks very much for your kind reply.
I would like to submit spark job remotely on another machine outside the
cluster,and the job will run on yarn, similar as hadoop job is already done,
could youconfirm it could exactly work for spark...
Do you mean that I would print those variables
Hi All,
I was wondering what does the Input Size in Application UI mean?
For my 3 node Cassandra Cluster, with 3 node Spark Cluster this size is 32GB.
For my 15 node Cassandra Cluster, with 15 node Spark Cluster this size reaches
172GB.
Though the data in both clusters is about same volume.
Your exception handling is occurring on the driver, where you
'configure' the job. I don't think it's what you mean to do. You
probably mean to do this within a function you are executing on data
within the cluster, like mapPartitions etc.
On Fri, Sep 25, 2015 at 5:20 AM, Samya
It’s really hard to answer this, as the comparison is not really fair – Storm
is much lower level than Spark and has less overhead when dealing with
stateless operations. I’d be curious how is your colleague implementing the
Average on a “batch” and what is the storm equivalent of a Batch.
Hi Radu,
The problem itself is not checkpointing the data – if your operations are
stateless then you are only checkpointing the kafka offsets, you are right.
The problem is that you are also checkpointing metadata – including the actual
Code and serialized java classes – that’s why you’ll see
Thanks all for the feedback so far.
I havn't decided which external storage will be used yet.
HBase is cool but it requires Hadoop in production. I only have 3-4 servers
for the whole things ( i am thinking of a relational database for this, can
be MariaDB, Memsql or mysql) but they are hard to
Hi all,
We're uising SQLContext.read.json to read in a stream of JSON datasets, but
sometimes the inferred schema contains for the same value a LongType, and
sometimes a DoubleType.
This obviously causes problems with merging the schema, so does anyone know a
way of forcing the inferred
Hello all,
I have a Spark streaming application that reads from a Flume Stream, does
quite a few maps/filters in addition to a few reduceByKeyAndWindow and join
operations before writing the analyzed output to ElasticSearch inside a
foreachRDD()...
I recently started to run this on a 2 node
Hi,
I am trying to submit a job on Spark 1.4 (with Spark Master):
bin/spark-submit --master spark://:7077 --driver-memory 4g
--executor-memory 4G --executor-cores 4 --num-executors 1
spark/examples/src/main/python/pi.py 6
which returns:
ERROR SparkDeploySchedulerBackend: Asked to remove
Faced with an issue where Spark temp files get filled under
/opt/spark-1.2.1/tmp on the local filesystem on the worker nodes. Which
parameter/configuration sets that location?
The spark-env.sh file has the folders set as,
export SPARK_HOME=/opt/spark-1.2.1
export SPARK_WORKER_DIR=$SPARK_HOME/tmp
Which version of spark are you having? Can you also check whats set in your
conf/spark-defaults.conf file?
Thanks
Best Regards
On Fri, Sep 25, 2015 at 1:58 AM, Gavin Yue wrote:
> Running Spark app over Yarn 2.7
>
> Here is my sparksubmit setting:
> --master yarn-cluster
What you mean by you are behind a NAT? Does it mean you are submitting your
jobs to a remote spark cluster from your local machine? If that's the case
then you need to take care of few ports (in the NAT)
http://spark.apache.org/docs/latest/configuration.html#networking which
assume random as
hi all,
when using JavaSparkSQL example,the code was submit many times as following:
/home/spark/software/spark/bin/spark-submit --deploy-mode cluster --class
org.apache.spark.examples.sql.JavaSparkSQL
hdfs://SuningHadoop2/user/spark/lib/spark-examples-1.4.0-hadoop2.4.0.jar
unfortunately ,
https://issues.apache.org/jira/browse/SPARK-10832
发件人: our...@cnsuning.com
发送时间: 2015-09-25 20:36
收件人: user
抄送: 494165115
主题: sometimes No event logs found for application using same JavaSparkSQL
example
hi all,
when using JavaSparkSQL example,the code was submit many times as
Hello all,
I have a question regarding the pipelining and parallelism of transformations
in Spark. I couldn’t find any documentation about it and I would really
appreciate your help if you could help me with it. I just started using and
reading Spark, so I guess my description may not be very
Problem turned out to be in too high 'spark.default.parallelism',
BinaryClassificationMetrics are doing combineByKey which internally shuffle
train dataset. Lower parallelism + cutting train set RDD history with
save/read into parquet solved the problem. Thanks for hint!
On Wed, Sep 23, 2015 at
Hi all,
The spark job will run on yarn. While I do not set SPARK_LOCAL_IP any, or just
set asexport SPARK_LOCAL_IP=localhost #or set as the specific node ip on
the specific spark install directory
It will work well to submit spark job on master node of cluster, however, it
will fail by
Parallel tasks totally depends on the # of partitions that you are having,
if you are not receiving sufficient partitions (partitions > total # cores)
then try to do a .repartition.
Thanks
Best Regards
On Fri, Sep 25, 2015 at 1:44 PM, N B wrote:
> Hello all,
>
> I have a
Hello,
I used a custom receiver in order to receive JMS messages from MQ Servers.
I want to benefit of Yarn cluster, my questions are :
- Is it possible to have only one node receiving JMS messages and parralelize
the RDD over all the cluster nodes ?
- Is it possible to parallelize also the
Just use the official connector from DataStax
https://github.com/datastax/spark-cassandra-connector
Your solution is very similar. Let’s assume the state is
case class UserState(amount: Int, updates: Seq[Int])
And your user has 100 - If your user does not see an update, you can emit
Hi,
My configurations are follows:
SPARK_EXECUTOR_INSTANCES=4
SPARK_EXECUTOR_MEMORY=1G
But on my spark UI it shows:
* *Alive Workers:*1
* *Cores in use:*4 Total, 0 Used
* *Memory in use:*6.7 GB Total, 0.0 B Used
Also while running a program in java for spark I am getting the
following
Try with spark.local.dir in the spark-defaults.conf or SPARK_LOCAL_DIR in
the spark-env.sh file.
Thanks
Best Regards
On Fri, Sep 25, 2015 at 2:14 PM, mufy wrote:
> Faced with an issue where Spark temp files get filled under
> /opt/spark-1.2.1/tmp on the local filesystem
And the remote machine is not in the same local area network with the cluster .
On Friday, September 25, 2015 12:28 PM, Zhiliang Zhu
wrote:
Hi Zhan,
I have done that as your kind help.
However, I just could use "hadoop fs -ls/-mkdir/-rm XXX" commands
Hi Folks,
I am trying to speed up my spark streaming job, I found a presentation by
Tathagata Das that mentions to increase value of
"spark.streaming.concurrentJobs" if I have more than one output.
In my spark streaming job I am reading from Kafka using Receiver Bases
approach and transforming
Hi I have 5 Spark jobs which needs to be run in parallel to speed up process
they take around 6-8 hours together. I have 93 container nodes with 8 cores
each memory capacity of around 2.8 TB. Now I runs each jobs with around 30
executors with 2 cores and 20 GB each. My each jobs processes around 1
Re: DB I strongly encourage you to look at Cassandra – it’s almost as powerful
as Hbase, a lot easier to setup and manage. Well suited for this type of
usecase, with a combination of K/V store and time series data.
For the second question, I’ve used this pattern all the time for “flash
Hi Steve,
Thanks a lot for your reply.
That is, some commands could work on the remote server gateway installed , but
some other commands will not work.As expected, the remote machine is not in the
same area network as the cluster, and the cluster's portis forbidden.
While I make the remote
Many thanks Cody, it explains quite a bit.
I had couple of problems with checkpointing and graceful shutdown moving
from working code in Spark 1.3.0 to 1.5.0. Having InterruptedExceptions,
KafkaDirectStream couldn't initialize, some exceptions regarding WAL even
I'm using direct stream. Meanwhile
On 25 Sep 2015, at 03:35, Zhang, Jingyu
> wrote:
I got following exception when I run
JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out?
thanks
15/09/25 12:24:32 INFO SparkContext: Successfully stopped
On 25 Sep 2015, at 05:25, Zhiliang Zhu
> wrote:
However, I just could use "hadoop fs -ls/-mkdir/-rm XXX" commands to operate at
the remote machine with gateway,
which means the namenode is reachable; all those commands only
Thank you Tathagata and Therry for your response. You guys were absolutely
correct that I created a dummy Dstream (to prevent Flume channel filling
up) and counted the messages but I didn't output(print), hence is why it
reported that error. Since I called print(), the error is no longer is
being
Hi Spark Users,
I am running some spark jobs which is running every hour.After running for
12 hours master is getting killed giving exception as
*java.lang.OutOfMemoryError: GC overhead limit exceeded*
It look like there is some memory issue in spark master.
Same kind of issue I noticed with
hello,
I am running the spark application.
I have installed the cloudera manager.
it includes the spark version 1.2.0
But now i want to use spark version 1.4.0.
its also working fine.
But when i try to access the HDFS in spark 1.4.0 in eclipse i am getting
the following error.
"Exception in
1) yes, just use .repartition on the inbound stream, this will shuffle data
across your whole cluster and process in parallel as specified.
2) yes, although I’m not sure how to do it for a totally custom receiver. Does
this help as a starting point?
1 - 100 of 102 matches
Mail list logo