I guess that this is not related to SparkR, but something wrong in the Spark
Core.
Could you try your application logic within spark-shell (you have to use Scala
DataFrame API) instead of SparkR shell and to see if this issue still happens?
-Original Message-
From: rporcio [mailto:rpor.
How are you submitting your job? You need to make sure HADOOP_CONF_DIR is
pointing to your hadoop configuration directory (with core-site.xml,
hdfs-site.xml files), If you have them set properly then make sure you are
giving the full hdfs url like:
dStream.saveAsTextFiles("hdfs://sigmoid-cluster:9
Make sure your firewall isn't blocking the requests.
Thanks
Best Regards
On Sat, Oct 24, 2015 at 5:04 PM, Amit Singh Hora
wrote:
> Hi All,
>
> I am trying to wrote an RDD as Sequence file into my Hadoop cluster but
> getting connection time out again and again ,I can ping the hadoop cluster
> a
You can use the .saveAsObjectFiles("hdfs://sigmoid/twitter/status/") since
you want to store the Status object and for every batch it will create a
directory under /status (name will mostly be the timestamp), since the data
is small (hardly couple of MBs for 1 sec interval) it will not overwhelm
th
You can do something like this:
val rddQueue = scala.collection.mutable.Queue(rdd1,rdd2,rdd3)
val qDstream = ssc.queueStream(rddQueue)
Thanks
Best Regards
On Sat, Oct 24, 2015 at 4:43 AM, Anfernee Xu wrote:
> Hi,
>
> Here's my situation, I have some kind of offline dataset, but I want to
You can actually look at this code base
https://github.com/apache/spark/blob/f85aa06464a10f5d1563302fd76465dded475a12/python/pyspark/rdd.py#L1825
_memory_limit function returns the amount of memory that you set with
spark.python.worker.memory and is used for groupBy and such operations.
Thank
Tom,
Have you set the “MASTER” evn variable on your machine? What is the value if
set?
From: Tom Stewart [mailto:stewartthom...@yahoo.com.INVALID]
Sent: Friday, October 30, 2015 10:11 PM
To: user@spark.apache.org
Subject: sparkR 1.5.1 batch yarn-client mode failing on daemon.R not found
I have
Have a look at the dynamic resource allocation listed here
https://spark.apache.org/docs/latest/job-scheduling.html
Thanks
Best Regards
On Thu, Oct 22, 2015 at 11:50 PM, Suman Somasundar <
suman.somasun...@oracle.com> wrote:
> Hi all,
>
>
>
> Is there a way to run 2 spark applications in paralle
Hi,
do you have something special in conf/spark-defaults.conf (especially on
the eventLog directory) ?
Regards
JB
On 11/02/2015 07:48 AM, Balachandar R.A. wrote:
Can someone tell me at what point this error could come?
In one of my use cases, I am trying to use hadoop custom input format.
H
Can someone tell me at what point this error could come?
In one of my use cases, I am trying to use hadoop custom input format. Here
is my code.
val hConf: Configuration = sc.hadoopConfiguration
hConf.set("fs.hdfs.impl",
classOf[org.apache.hadoop.hdfs.DistributedFileSystem].getName)
hConf.set("fs
Hi,
Any ideas what's going wrong or how to fix it? Do I have to downgrade to
0.9.x to be able to use Spark?
Best regards,
*Sampo Niskanen*
*Lead developer / Wellmo*
sampo.niska...@wellmo.com
+358 40 820 5291
On Fri, Oct 30, 2015 at 4:57 PM, Sampo Niskanen
wrote:
> Hi,
>
> I'm
Hi, Matej,
For the convenience of SparkR users, when they start SparkR without using
bin/sparkR, (for example, in RStudio),
https://issues.apache.org/jira/browse/SPARK-11340 enables setting of
“spark.driver.memory”, (also other similar options, like:
spark.driver.extraClassPath, spark.driver.e
Is your process getting killed...
if yes then try to see using dmesg.
On Mon, Nov 2, 2015 at 8:17 AM, karthik kadiyam <
karthik.kadiyam...@gmail.com> wrote:
> Did any one had issue setting spark.driver.maxResultSize value ?
>
> On Friday, October 30, 2015, karthik kadiyam
> wrote:
>
>> Hi Shahid
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
[Hao:] A distributed JOIN operation (either HashBased or SortBased Join)
requ
Hi Spark guru
I am evaluating Spark Streaming,
In my application I need to maintain cumulative statistics (e.g the total
running word count), so I need to call the updateStateByKey function on
very micro-batch.
After setting those things, I got following behaviors:
* The Processing Time
yes, the first code takes only 30mins.
but the second method, I wait for 5 hours, only finish 10%
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html
Sent from the Apache Spark User List mailing list
Hi,
I'm trying to understand SortMergeJoin (SPARK-2213).
1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For
example, in the code below, the two datasets have different number of
partitions, but it still does a SortMerge join after a "hashpartitioning".
CODE:
val sparkCo
For the constrains like all weights >=0, people do LBFGS-B which is
supported in our optimization library, Breeze.
https://github.com/scalanlp/breeze/issues/323
However, in Spark's LiR, our implementation doesn't have constrain
implementation. I do see this is useful given we're experimenting
SLIM
Did any one had issue setting spark.driver.maxResultSize value ?
On Friday, October 30, 2015, karthik kadiyam
wrote:
> Hi Shahid,
>
> I played around with spark driver memory too. In the conf file it was set
> to " --driver-memory 20G " first. When i changed the spark driver
> maxResultSize from
Hi Koert,
the physical plan looks like it is doing the right thing:
partitioned table hdfs://user/koert/test, read date from the directory
names, hash partitioned and agg the date to find distinct date. Finally
shuffle the dates for sort and limit 1 operations.
This is my understanding of the ph
I agreed the max date will satisfy the latest date requirement but it does
not satisfy the second last date requirement you mentioned.
Just for your information, before you invested in the partitioned table too
much, I want to warn you that it has memory issues (both on executors and
driver side).
Hi Sparkers.
I am very new to Spark, and I am occasionally getting RpCTimeoutException
with the following error.
15/11/01 22:19:46 WARN HeartbeatReceiver: Removing executor 0 with no
> recent heartbeats: 321792 ms exceeds timeout 30 ms
> 15/11/01 22:19:46 ERROR TaskSchedulerImpl: Lost executo
i was going for the distinct approach, since i want it to be general enough
to also solve other related problems later. the max-date is likely to be
faster though.
On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam wrote:
> Hi Koert,
>
> You should be able to see if it requires scanning the whole data by
Hi Koert,
You should be able to see if it requires scanning the whole data by
"explain" the query. The physical plan should say something about it. I
wonder if you are trying the distinct-sort-by-limit approach or the
max-date approach?
Best Regards,
Jerry
On Sun, Nov 1, 2015 at 4:25 PM, Koert
it seems pretty fast, but if i have 2 partitions and 10mm records i do have
to dedupe (distinct) 10mm records
a direct way to just find out what the 2 partitions are would be much
faster. spark knows it, but its not exposed.
On Sun, Nov 1, 2015 at 4:08 PM, Koert Kuipers wrote:
> it seems to wor
good idea. with the dates sorting correctly alphabetically i should be able
to do something similar with strings
On Sun, Nov 1, 2015 at 4:06 PM, Jörn Franke wrote:
> Try with max date, in your case it could make more sense to represent the
> date as int
>
> Sent from my iPhone
>
> On 01 Nov 2015
Try with max date, in your case it could make more sense to represent the date
as int
Sent from my iPhone
> On 01 Nov 2015, at 21:03, Koert Kuipers wrote:
>
> hello all,
> i am trying to get familiar with spark sql partitioning support.
>
> my data is partitioned by date, so like this:
> dat
Hi Koert,
If the partitioned table is implemented properly, I would think "select
distinct(date) as dt from table order by dt DESC limit 1" would return the
latest dates without scanning the whole dataset. I haven't try it that
myself. It would be great if you can report back if this actually work
hello all,
i am trying to get familiar with spark sql partitioning support.
my data is partitioned by date, so like this:
data/date=2015-01-01
data/date=2015-01-02
data/date=2015-01-03
...
lets say i would like a batch process to read data for the latest date
only. how do i proceed?
generally the
Hi Ted Yu,
Thanks very much for your kind reply.Do you just mean that in spark there is no
specific package for simplex method?
Then I may try to fix it by myself, do not decide whether it is convenient to
finish by spark, before finally fix it.
Thank you,Zhiliang
On Monday, November 2,
A brief search in code base shows the following:
TODO: Add simplex constraints to allow alpha in (0,1).
./mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
I guess the answer to your question is no.
FYI
On Sun, Nov 1, 2015 at 9:37 AM, Zhiliang Zhu
wrote:
> Dear All,
>
> As
Dear All,
As I am facing some typical linear programming issue, and I know simplex method
is specific in solving LP question, I am very sorry that whether there is
already some mature package in spark about simplex method...
Thank you very much~Best Wishes!Zhiliang
Dear All,
As for N dimension linear regression, while the labeled training points number
(or the rank of the labeled point space) is less than N, then from perspective
of math, the weight of the trained linear model may be not unique.
However, the output of model.weight() by spark may be with so
[adding dev list since it's probably a bug, but i'm not sure how to
reproduce so I can open a bug about it]
Hi,
I have a standalone Spark 1.4.0 cluster with 100s of applications running
every day.
>From time to time, the applications crash with the following error (see
below)
But at the same tim
HI guys
I have documented the steps involved in getting Spark 1.5.1 run on CDH
5.4.0 here, let me know if it works for you as well
https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa?trk=hp-feed-article-title-publish
Looking forward to CDH 5.5 which supports Spark 1.5.x out o
Hi.
What is slow exactly?
In code-base 1:
When you run the persist() + count() you stored the result in RAM.
Then the map + reducebykey is done on in-memory data.
In the latter case (all-in-oneline) you are doing both steps at the same
time.
So you are saying that if you sum-up the time to
Hi.
You may want to look into Indexed RDDs
https://github.com/amplab/spark-indexedrdd
Regards,
Gylfi.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-lookup-by-a-key-in-an-RDD-tp25243p25247.html
Sent from the Apache Spark User List mailin
Thanks for reporting it, Sjoerd. You might have a different version of
Janino brought in from somewhere else.
This should fix your problem: https://github.com/apache/spark/pull/9372
Can you give it a try?
On Tue, Oct 27, 2015 at 9:12 PM, Sjoerd Mulder
wrote:
> No the job actually doesn't fai
38 matches
Mail list logo