I have been giving it 8-12G
-Raj
Sent from my iPhone
> On Jan 12, 2016, at 6:50 PM, Sabarish Sasidharan
> wrote:
>
> How much RAM are you giving to the driver? 17000 items being collected
> shouldn't fail unless your driver memory is too low.
>
> Regards
>
Hi Richard,
> Would it be possible to access the session API from within ROSE,
> to get for example the images that are generated by R / openCPU
Technically it would be possible although there would be some potentially
significant runtime costs per task in doing so, primarily those related to
Any suggestion/opinion?
On 12-Jan-2016 2:06 pm, "Bharath Ravi Kumar" wrote:
> We're running PCA (selecting 100 principal components) on a dataset that
> has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
> matrix in question is mostly sparse with tens of
Which release of Spark are you using ?
Can you turn on DEBUG logging to see if there is more clue ?
Thanks
On Tue, Jan 12, 2016 at 6:37 PM, AlexG wrote:
> I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array
> of
> rows in Array[Array[Float]] format
I have used master_ip as ip address and spark conf also has Ip address . But
the following logs shows hostname. (The spark Ui shows master details in IP)
16/01/13 12:31:38 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkDriver@masternode1:36537] has failed,
sowen wrote
> Arrays are not immutable and do not have the equals semantics you want to
> use them as a key. Use a Scala immutable List.
> On Oct 9, 2014 12:32 PM, "Ge, Yao (Y.)"
> yge@
> wrote:
>
>> Yes. I was using String array as arguments in the reduceByKey. I think
>> String array is
I transpose a matrix (colChunkOfA) stored as a 200-by-54843210 as an array of
rows in Array[Array[Float]] format into another matrix (rowChunk) also
stored row-wise as a 54843210-by-200 Array[Array[Float]] using the following
code:
val rowChunk = new Array[Tuple2[Int,Array[Float]]](numCols)
Hi All,
I have used a hiveContext.sql() to join a temporary table created from
Dataframe and parquet tables created in Hive.
The join query runs fine for few hours and then suddenly fails to do the
Join. Once the issue happens the dataframe returned from
hiveContext.sql() is empty. If I
I am trying to debug my trained model by exploring theta
Theta is a Matrix. The java Doc for Matrix says that it is column major
formate
I have trained a NaiveBayesModel. Is the number of classes == to the number
of rows?
int numRows = nbModel.numClasses();
int numColumns =
You could generate as many duplicates with a tag/sequence. And then use a
custom partitioner that uses that tag/sequence in addition to the key to do
the partitioning.
Regards
Sab
On 12-Jan-2016 12:21 am, "Daniel Imberman"
wrote:
> Hi all,
>
> I'm looking for a way to
Hi All,
I am running spark 1.3.0 standalone cluster mode, we have rebooted the cluster
servers (system reboot). After that the spark jobs are failing by showing
following error (it fails within 7-8 seconds). 2 of the jobs are running fine.
All the jobs used to be stable before the system
What's the proper way to write DataSets to disk? Convert them to a
DataFrame and use the writers there?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
I'm using Spark 1.5.1
When I turned on DEBUG, I don't see anything that looks useful. Other than
the INFO outputs, there is a ton of RPC message related logs, and this bit:
16/01/13 05:53:43 DEBUG ClosureCleaner: +++ Cleaning closure
(org.apache.spark.rdd.RDD$$anonfun$count$1) +++
16/01/13
Hello there
I run both driver and master on the same node, so I got "Port already in
use" exception.
Is there any solution to set different port for each component?
Kyle
2015-12-05 5:54 GMT+08:00 spearson23 :
> Run "spark-submit --help" to see all available options.
>
The code is very simple, pasted below .
hive-site.xml is in spark conf already. I still see this error
Error in writeJobj(con, object) : invalid jobj 3
after running the script below
script
===
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
Complete stacktrace is. Can it be something wih java versions?
stop("invalid jobj ", value$id)
8
writeJobj(con, object)
7
writeObject(con, a)
6
writeArgs(rc, args)
5
invokeJava(isStatic = TRUE, className, methodName, ...)
4
callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
Does spark support model deployment through web/rest api. Is there any other
method for deployment of a predictive model in spark apart from PMML.
Regards,
Chandan
Hello guys
I got a solution.
I set -Dcom.sun.management.jmxremote.port=0 to let system assign a unused
port.
Kyle
2016-01-12 16:54 GMT+08:00 Kyle Lin :
> Hello there
>
>
> I run both driver and master on the same node, so I got "Port already in
> use" exception.
>
> Is
Hi all,
we'd like to upgrade one of our Spark jobs from 1.4.1 to 1.5.2 (we run
Spark on Amazon EMR).
The job consumes and pushes lz4 compressed data from/to Kafka.
When upgrading to 1.5.2 everything works fine, except we get the following
exception:
java.lang.NoSuchMethodError:
Thanks for your answer, you are correct, it's just a different approach
than the one I am asking for :)
Building an uber- or assembly- jar goes against the idea of placing the
jars on all workers. Uber-jars increase network traffic, using local:/
in the classpath reduces network traffic.
Hi Charles,
I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to
illustrate the problem.
https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala
In a short - may be persist method is working but not like I expected.
I thought
Hi Prabhu thanks for the response. I did the same the problem is when I get
process id using jps or ps - ef I don't get user in the very first column I
see number in place of user name so can't run jstack on it because of
permission issue it gives something like following
728852 3553 9833
We're running PCA (selecting 100 principal components) on a dataset that
has ~29K columns and is 70G in size stored in ~600 parts on HDFS. The
matrix in question is mostly sparse with tens of columns populate in most
rows, but a few rows with thousands of columns populated. We're running
spark on
hi, guys.
We have set up the dynamic allocation resource on spark-yarn. Now we use
spark 1.5.
One executor tries to fetch data from another nodemanager's shuffle
service, and the nodemanager crashes, which makes the executor stop on the
states util the crashed nodemanager has been launched again.
It looks like you have overwritten sc. Could you try this:
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"),
.libPaths()))library(SparkR)
sc <- sparkR.init()hivecontext <- sparkRHive.init(sc)df <- loadDF(hivecontext,
Running this gave
16/01/12 04:06:54 INFO BlockManagerMaster: Registered
BlockManagerError in writeJobj(con, object) : invalid jobj 3
How does it know which hive schema to connect to?
On Tue, Jan 12, 2016 at 2:34 PM, Felix Cheung
wrote:
> It looks like you have
Hello,
I've a 5 nodes cluster which hosts both hdfs datanodes and spark workers.
Each node has 8 cpu and 16G memory. Spark version is 1.5.2, spark-env.sh is
as follow:
export SPARK_MASTER_IP=10.52.39.92
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=8
export SPARK_WORKER_MEMORY=4g
Hi,
You can checkout http://spark.apache.org/docs/latest/monitoring.html,
you can monitor hdfs, memory usage per job and executor and driver. I
have connected it to Graphite for storage and Grafana for
visualization. I have also connected to collectd which provides me all
server nodes metrics
Ignite can also cache rdd
> On 12 Jan 2016, at 13:06, Dmitry Goldenberg wrote:
>
> Jorn, you said Ignite or ... ? What was the second choice you were thinking
> of? It seems that got omitted.
>
>> On Jan 12, 2016, at 2:44 AM, Jörn Franke
Hi
I'm trying to understand how to lookup certain id fields of RDDs to an
external mapping table. The table is accessed through a two-way binary
tcp client where an id is provided and entry returned. Entries cannot
be listed/scanned.
What's the simplest way of managing the tcp client and its
The similarities returned are not in fact true cosine similarities as they
are not properly normalized - this will be fixed in this PR:
https://github.com/apache/spark/pull/10152
On Tue, Dec 15, 2015 at 2:54 AM, jxieeducation
wrote:
> Hi,
>
> For Word2Vec in Mllib, when
It worked for sometime. Then I did sparkR.stop() an re-ran again to get
the same error. Any idea why it ran fine before ( while running fine it
kept giving warning reusing existing spark-context and that I should
restart) ? There is one more R code which instantiated spark , I ran that
too again.
Thanks Micheal. Let me test it with a recent master code branch.
Also for every mapping step should I have to create a new case class? I
cannot use Tuple as I have ~130 columns to process. Earlier I had used a
Seq[Any] (actually Array[Any] to optimize on serialization) but processed
it using RDD
try mapPartitionsWithIndex .. below is an example I used earlier. myfunc
logic can be further modified as per your need.
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)
def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = {
iter.toList.map(x => index + "," + x).iterator
}
Ran into this need myself. Does Spark have an equivalent of "mapreduce.
input.fileinputformat.list-status.num-threads"?
Thanks.
On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote:
> Hi,
>
> I am wondering if anyone has successfully enabled
>
Hello All -
I'm just trying to understand aggregate() and in the meantime got an
question.
*Is there any way to view the RDD databased on the partition ?.*
For the instance, the following RDD has 2 partitions
val multi2s = List(2,4,6,8,10,12,14,16,18,20)
val multi2s_RDD =
Hello,
I am tinkering with Spark 1.6. I have this 1.5 Billion rows data, to which I
apply several window functions such as lag, first, etc. The job is quite
expensive, I am running a small cluster with executors running with 70GB of ram.
Using new memory management system, the job fails around
Hi,
I wanted to know if there are any implementations yet within the Machine
Learning Library or generally that can efficiently solve eigenvalue problems?
Or if not do you have suggestions on how to approach a parallel execution maybe
with BLAS or Breeze?
Thanks in advance!
Lydia
Von meinem
>
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
>
> This looks like a bug. What is the error? It might be fixed in
branch-1.6/master if you can test there.
> Please advice on what I may be missing here?
>
>
> Also for join, may I suggest to have a custom encoder / transformation to
>
>export SPARK_WORKER_MEMORY=4g
May be you could increase the max heapsize on the worker? In case if the
OutOfMemory is for the driver, then you may want to set it up explicitly
for the driver.
Thanks,
On Tue, Jan 12, 2016 at 2:04 AM, Barak Yaish wrote:
> Hello,
>
>
Any thoughts over this? I want to know when window duration is complete
and not the sliding window. Is there a way I can catch end of Window
Duration or do I need to keep track of it and how?
LCassa
On Mon, Jan 11, 2016 at 3:09 PM, Cassa L wrote:
> Hi,
> I'm trying to
I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor
> On 12 Jan 2016, at 18:32, Corey Nolet wrote:
>
> David,
>
> Thank you very much for announcing this! It looks like it could be very
> useful. Would you mind providing a link to the github?
>
I think it would be this: https://github.com/onetapbeyond/opencpu-spark-executor
> On 12 Jan 2016, at 18:32, Corey Nolet wrote:
>
> David,
>
> Thank you very much for announcing this! It looks like it could be very
> useful. Would you mind providing a link to the github?
>
I tried to rerun the same code with current snapshot version of 1.6 and 2.0
from
https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-core_2.11/
But I still see an exception around the same line. Here is the exception
below. Filed an issue against the same
we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
wouldn't even finish overnight anymore. the change was that the rdd was now
derived from a dataframe.
so the new code that runs forever is something like this:
dataframe.rdd.map(row => (Row(row(0)), row)).join(...)
any idea
Awesome, thanks for opening the JIRA! We'll take a look.
On Tue, Jan 12, 2016 at 1:53 PM, Muthu Jayakumar wrote:
> I tried to rerun the same code with current snapshot version of 1.6 and
> 2.0 from
>
Hi!
I have some application (skeleton):
val sc = new SparkContext($SOME_CONF)
val input = sc.textFile(inputFile)
val result = input.map(record => {
val myState = new MyState() // state
})
.filter($SOME_FILTER)
.sortBy($SOME_SORT)
.partitionBy(new HashPartitioner(100))
I observe that YARN jobs history logs are created in /user/history/done
(*.jhist files) for all the mapreduce jobs like hive, pig etc. But for spark
jobs submitted in yarn-cluster mode, the logs are not being created.
I would like to see resource utilization by spark jobs. Is there any other
I'd guess that if the resources are broadcast Spark would put them into
Tachyon...
> On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg
> wrote:
>
> Would it make sense to load them into Tachyon and read and broadcast them
> from there since Tachyon is already a part of
No, this is on purpose. Have a look at the build POM. A few Guava classes
were used in the public API for Java and have had to stay unshaded. In 2.x
/ master this is already changed such that no unshaded Guava classes should
be included.
On Tue, Jan 12, 2016, 07:28 Jake Yoon
Jorn, you said Ignite or ... ? What was the second choice you were thinking of?
It seems that got omitted.
> On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote:
>
> You can look at ignite as a HDFS cache or for storing rdds.
>
>> On 11 Jan 2016, at 21:14, Dmitry Goldenberg
On EMR, you can add fs.* params in emrfs-site.xml.
On Tue, Jan 12, 2016 at 7:27 AM, Jonathan Kelly
wrote:
> Yes, IAM roles are actually required now for EMR. If you use Spark on EMR
> (vs. just EC2), you get S3 configuration for free (it goes by the name
> EMRFS), and it
2 cents:
1. You should use an environment management tool, such as ansible, puppet
or chef to handle this kind of use cases (and lot more, Eg what if you want
to add more nodes or to replace one bad node)
2. There are options such as -py-files to provide a zip file
On Tue, Jan 12, 2016 at 6:11
Hi,
Are there any plans to implement the "Top K Parallel FPGrowth" algorithm in
Spark, that aggregates the frequent items together (similar to the mahout
implementation).
Thanks
John
--
View this message in context:
Would it make sense to load them into Tachyon and read and broadcast them from
there since Tachyon is already a part of the Spark stack?
If so I wonder if I could do that Tachyon read/write via a Spark API?
> On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan
>
Thanks, Gene.
Does Spark use Tachyon under the covers anyway for implementing its
"cluster memory" support?
It seems that the practice I hear the most about is the idea of loading
resources as RDD's and then doing join's against them to achieve the lookup
effect.
The other approach would be to
There can be dataloss when you are using the DirectOutputCommitter and
speculation is turned on, so we disable it automatically.
On Tue, Jan 12, 2016 at 1:11 PM, Jerry Lam wrote:
> Hi spark users and developers,
>
> I wonder if the following observed behaviour is expected.
Hi ,
I have a strange behavior when i creating standalone spark container using
docker
Not sure why by default it is assigning 4 cores to the first Job it submit
and then all the other jobs are in wait state , Please suggest if there is
an setting to change this
i tried --executor-cores 1 but
Hi spark users and developers,
I wonder if the following observed behaviour is expected. I'm writing
dataframe to parquet into s3. I'm using append mode when I'm writing to it.
Since I'm using org.apache.spark.sql.
parquet.DirectParquetOutputCommitter as
the
Hello Prem -
Thanks for sharing and I also found the similar example from the link
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate
But trying the understand the actual functionality or behavior.
Thanks & Regards,
Gokula Krishnan* (Gokul)*
On Tue, Jan 12, 2016 at
I had explored these examples couple of months back. very good link for RDD
operations. see if below explanation helps, try to understand the
difference between below 2 examples.. initial value in both is """
Example 1;
val z = sc.parallelize(List("12","23","","345"),2)
z.aggregate("")((x,y) =>
btw, this issue happens only with classes needed for the inputFormat. if
the input format is org.apache.hadoop.mapred.TextInputFormat and the serde
is from an additional jar it works just fine.
I don't want to upgrade cdh for this. also, if it should work on cdh5.5 why
is that. what patch fixes
I also had a quick look and agree it’s not very clear. I believe if one reads
through the clustering logic and the replication settings would get a good idea
of how it works.
https://apacheignite.readme.io/docs/cluster
I believe it integrates with Hadoop and other file based systems for
Hi all,
I'd like to share news of the recent release of a new Spark package,
[ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor).
ROSE is a Scala library offering access to the full scientific computing power
of the R programming language to Apache Spark batch and
I call stop from console as R studio warns and advises it. And yes. after
stop was called the whole script was run again together. It means init
"hivecontext <- sparkRHive.init(sc)" is called after stop always.
On Tue, Jan 12, 2016 at 8:31 PM, Felix Cheung
wrote:
>
As you can see from my reply below from Jan 6, calling sparkR.stop()
invalidates both sc and hivecontext you have and results in this invalid jobj
error.
If you start R and run this, it should work:
Sys.setenv(SPARK_HOME="/usr/hdp/current/spark-client")
Thanks! Setting java.io.tmpdir did the trick. Sadly, I still ran into an
issue with the amount of RAM pyspark was grabbing. In fact I got a message
from my web provider warning that I was exceeding the memory limit for my
(entry level) account. So I won't be pursuing it farther. Oh well, it
As I understand it, your initial number of partitions will always depend on
the initial data. I'm not aware of any way to change this, other than
changing the configuration of the underlying data store.
Have you tried reading the data in several data frames (e.g. one data frame
per day),
it spark 1.5.1
the dataframe has simply 2 columns, both string
a sql query would be more efficient probably, but doesnt fit out purpose
(we are doing a lot more stuff where we need rdds).
also i am just trying to understand in general what in that rdd coming from
a dataframe could slow things
Can you please provide the high-level schema of the entities that you are
attempting to join? I think that you may be able to use a more efficient
technique to join these together; perhaps by registering the Dataframes as
temp tables and constructing a Spark SQL query.
Also, which version of
I see. So there are actually 3000 tasks instead of 3000 jobs right?
Would you mind to provide the full stack trace of the GC issue? At first
I thought it's identical to the _metadata one in the mail thread you
mentioned.
Cheng
On 1/11/16 5:30 PM, Gavin Yue wrote:
Here is how I set the conf:
Hi, I'm moving my infrastructure from 1.5.2 to 1.6.0 and experiencing
serious issue. I successfully updated spark thrift server from 1.5.2 to
1.6.0. But I have standalone application, which worked fine with 1.5.2 but
failing on 1.6.0 with:
*NestedThrowables:*
*java.lang.ClassNotFoundException:
Hi Michael,
Thanks for the hint! So if I turn off speculation, consecutive appends like
above will not produce temporary files right?
Which class is responsible for disabling the use of DirectOutputCommitter?
Thank you,
Jerry
On Tue, Jan 12, 2016 at 4:12 PM, Michael Armbrust
Alex, see this jira-
https://issues.apache.org/jira/browse/SPARK-9926
On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky <
alex.nastet...@vervemobile.com> wrote:
> Ran into this need myself. Does Spark have an equivalent of "mapreduce.
> input.fileinputformat.list-status.num-threads"?
>
> Thanks.
Hi,
this looks great and seems to be very usable.
Would it be possible to access the session API from within ROSE, to get for
example the images that are generated by R / openCPU and the logging to
stdout that is logged by R?
thanks in advance,
Richard
On Tue, Jan 12, 2016 at 10:16 PM, Vijay
Hi,
Is there a way to read a text file from inside a spark executor? I need to
do this for an streaming application where we need to read a file(whose
contents would change) from a closure.
I cannot use the "sc.textFile" method since spark context is not
serializable. I also cannot read a file
Folks:We are running into a problem where FPGrowth seems to choke on data sets
that we think are not too large. We have about 200,000 transactions. Each
transaction is composed of on an average 50 items. There are about 17,000
unique item (SKUs) that might show up in any transaction.
When
Thanks. I was actually able to get mapreduce.input.
fileinputformat.list-status.num-threads working in Spark against a regular
fileset in S3, in Spark 1.5.2 ... looks like the issue is isolated to Hive.
On Tue, Jan 12, 2016 at 6:48 PM, Cheolsoo Park wrote:
> Alex, see this
Hi Dmitry,
Yes, Tachyon can help with your use case. You can read and write to Tachyon
via the filesystem api (
http://tachyon-project.org/documentation/File-System-API.html). There is a
native Java API as well as a Hadoop-compatible API. Spark is also able to
interact with Tachyon via the
Hi all,
I'd like to share news of the recent release of a new Spark package,
[ROSE](http://spark-packages.org/package/onetapbeyond/opencpu-spark-executor).
ROSE is a Scala library offering access to the full scientific computing power
of the R programming language to Apache Spark batch and
Hi I have the following code which I run as part of thread which becomes
child job of my main Spark job it takes hours to run for large data around
1-2GB because of coalesce(1) and if data is in MB/KB then it finishes faster
with more data sets size sometimes it does not complete at all. Please
David,
Thank you very much for announcing this! It looks like it could be very
useful. Would you mind providing a link to the github?
On Tue, Jan 12, 2016 at 10:03 AM, David
wrote:
> Hi all,
>
> I'd like to share news of the recent release of a new Spark
Hi Corey,
> Would you mind providing a link to the github?
Sure, here is the github link you're looking for:
https://github.com/onetapbeyond/opencpu-spark-executor
David
"All that is gold does not glitter, Not all those who wander are lost."
Original Message
Subject: Re:
Definitely a great news for all the R and spark guys over here.
From: Corey Nolet [mailto:cjno...@gmail.com]
Sent: Tuesday, January 12, 2016 11:02 PM
To: David
Cc: user@spark.apache.org; d...@spark.apache.org
Subject: Re: ROSE: Spark + R on the JVM.
David,
Thank you very much for announcing
84 matches
Mail list logo