I've isolated this to a memory issue but I don't know what parameter I need
to tweak. If I sample my samples RDD with 35% of the data, everything runs
to completion, with 35%, it fails. In standalone mode, I can run on the full
RDD without any problems.
// works
val samples =
Thanks. This was already helping a bit. But the examples don't use custom
InputFormats. Rather, org.apache fully qualified InputFormat. If I want to
use my own custom InputFormat in form of .class (or jar) how can I use it? I
tried providing it to pyspark with --jars myCustomInputFormat.jar
and
Try many combinations of parameters on a small dataset, find the best,
and then try to map them to a big dataset. You can also reduce the
search region iteratively based on the best combination in the current
iteration. -Xiangrui
On Wed, Aug 13, 2014 at 1:13 AM, Hoai-Thu Vuong thuv...@gmail.com
Could you try to map it to row-majored first? Your approach may
generate multiple copies of the data. The code should look like this:
~~~
val rows = rdd.map { case (j, values) =
values.view.zipWithIndex.map { case (v, i) =
(i, (j, v))
}
}.groupByKey().map { case (i, entries) =
bump. same problem here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Job-aborted-due-to-stage-failure-TID-x-failed-for-unknown-reasons-tp10187p12095.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hello,
I wrote a class named BooleanPair:
public static class BooleanPairet implements Serializable{
public Boolean elementBool1;
public Boolean elementBool2;
BooleanPair(Boolean bool1, Boolean bool2){elementBool1 = bool1;
elementBool2 = bool2;}
public String
FlatMap the JavaRDDBooleanPair[] to JavaRDDBooleanPair. Then it should
work.
TD
On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li gefeili.2...@gmail.com wrote:
Hello,
I wrote a class named BooleanPair:
public static class BooleanPairet implements Serializable{
public Boolean
Hi,
I am running spark from the git directly. I recently compiled the newer
version Aug 13 version and it has performance drop of 2-3x in read from
HDFS compare to git version of Aug 1. So I am wondering which commit
would have cause such an issue in read performance. The performance is
almost
Thank you! It works so well for me!
Regards,
Gefei
On Thu, Aug 14, 2014 at 4:25 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
FlatMap the JavaRDDBooleanPair[] to JavaRDDBooleanPair. Then it should
work.
TD
On Thu, Aug 14, 2014 at 1:23 AM, Gefei Li gefeili.2...@gmail.com wrote:
oh, right, i meant within SqlContext alone, schemaRDD from text file with a
case class
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tables-tp11851p12100.html
Sent from the Apache Spark User List mailing list
A man in this community give me a video:
https://www.youtube.com/watch?v=sPhyePwo7FA. I've got a same question in
this community and other guys helped me to solve this problem. I'm trying
to load MatrixFactorizationModel from object file, but compiler said that,
I can not create object because the
It is interesting to save a RDD on a disk or HDFS or somethings else as a
set of objects, but I think it's more useful to save it as a text file for
debugging or just as an output file. If we want to reuse a RDD, text file
also works, but perhaps a set of object files will bring a decrease on
Did you check out http://www.spark-stack.org/spark-cluster-on-google-compute/
already?
Cheers,
Michael
--
Michael Hausenblas
Ireland, Europe
http://mhausenblas.info/
On 14 Aug 2014, at 05:17, Soumya Simanta soumya.sima...@gmail.com wrote:
Before I start doing something on
I started up a cluster on EC2 (using the provided scripts) and specified a
different instance type for the master and the the worker nodes. The cluster
started fine, but when I looked at the cluster (via port 8080), it showed that
the amount of memory available to the worker nodes did not
Hi Darin,
This is the piece of code
https://github.com/mesos/spark-ec2/blob/v3/deploy_templates.py doing the
actual work (Setting the memory). As you can see, it leaves 15Gb of ram for
OS on a 100Gb machine... 2Gb RAM on a 10-20Gb machine etc.
You can always set
I have tried that already but still get the same error.
To be honestly, I feel as though I am missing something obvious with my
configuration, I just can't find what it may be.
Miroslaw Horbal
On Wed, Aug 13, 2014 at 10:38 PM, jerryye [via Apache Spark User List]
Hi Hoai-Thu, the issue of private default constructor is unlikely the cause
here, since Lance was already able to load/deserialize the model object.
And on that side topic, I wish all serdes libraries would just use
constructor.setAccessible(true) by default :-) Most of the time that
privacy is
I’d suggest something like Apache YARN, or Apache Mesos with Marathon or
something similar to allow for management, in particular restart on failure.
mn
On Aug 13, 2014, at 7:15 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Thu, Aug 14, 2014 at 5:49 AM, salemi alireza.sal...@udo.edu
What about down-scaling when I use Mesos, does that really deteriorate the
performance ? Otherwise we would probably go for spark on mesos on ec2 :)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Down-scaling-Spark-on-EC2-cluster-tp10494p12109.html
Sent
As we know, in Spark, SparkContext provide the wholeTextFile() method to read
all files in the specific directory, then generate RDD(fileName, content):
scala val lines = sc.wholeTextFiles(/Users/workspace/scala101/data)
14/08/14 22:43:02 INFO MemoryStore: ensureFreeSpace(35896) called with
I think I can reproduce this error.
The following code cannot work and report Foo cannot be serialized. (log
in gist https://gist.github.com/zsxwing/4f9f17201d4378fe3e16):
class Foo { def foo() = Array(1.0) }
val t = new Foo
val m = t.foo
val r1 = sc.parallelize(List(1, 2, 3))
val r2 = r1.map(_
You also need to ensure you're using checkpointing and support recreating the
context on driver failure as described in the docs here:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node
From: Matt Narrell
Following codes works, too
class Foo1 extends Serializable { def foo() = Array(1.0) }
val t1 = new Foo1
val m1 = t1.foo
val r11 = sc.parallelize(List(1, 2, 3))
val r22 = r11.map(_ + m1(0))
r22.toArray
On Thu, Aug 14, 2014 at 10:55 PM, Shixiong Zhu [via Apache Spark User List]
I think in the following case
class Foo { def foo() = Array(1.0) }
val t = new Foo
val m = t.foo
val r1 = sc.parallelize(List(1, 2, 3))
val r2 = r1.map(_ + m(0))
r2.toArray
Spark should not serialize t. But looks it will.
Best Regards,
Shixiong Zhu
2014-08-14 23:22 GMT+08:00 lancezhange
Good timing! I encountered that same issue recently and to address it, I
changed the default Class.forName call to Utils.classForName. See my patch
at https://github.com/apache/spark/pull/1916. After that change, my
bin/pyspark --jars worked.
On Wed, Aug 13, 2014 at 11:47 PM, Tassilo Klein
Hi All,
I have a Spark job for which I need to increase the amount of memory
allocated to the driver to collect a large-ish (200M) data structure.
Formerly, I accomplished this by setting SPARK_MEM before invoking my
job (which effectively set memory on the driver) and then setting
You can try something like this,
val kvRdd = sc.textFile(rawdata/).map( m = {
val
pfUser = m.split(t,2)
(pfUser(0) - pfUser(1))})
Hi Yanbo,I think it was happening because some of the rows did not have all the
columns. We are cleaning up the data and will let you know once we confirm this.
Date: Thu, 14 Aug 2014 22:50:58 +0800
Subject: Re: java.lang.UnknownError: no bin was found for continuous variable.
From:
I tried with simple spark-hive select and insert, and it works. But to directly
manipulate the ORCFile through RDD, spark has to be upgraded to support
hive-0.13 first. Because some ORC API is not exposed until Hive-0.12.
Thanks.
Zhan Zhang
On Aug 11, 2014, at 10:23 PM,
I have a mlib model:
val model = DecisionTree.train(parsedData, Regression, Variance, maxDepth)
I see model has following methods:algo asInstanceOf isInstanceOf
predicttoString topNode
model.topNode outputs:org.apache.spark.mllib.tree.model.Node = id = 0, isLeaf =
Hi All,
I'm having some trouble setting the disk spill directory for spark. The
following approaches set spark.local.dir (according to the Environment
tab of the web UI) but produce the indicated warnings:
*In spark-env.sh:*
export SPARK_JAVA_OPTS=-Dspark.local.dir=/spark/spill
*Associated
Actually I faced it yesterday...
I had to put it in spark-env.sh and take it out from spark-defaults.conf on
1.0.1...Note that this settings should be visible on all workers..
After that I validated that SPARK_LOCAL_DIRS was indeed getting used for
shuffling...
On Thu, Aug 14, 2014 at 10:27
Yes, thanks great. This seems to be the issue.
At least running with spark-submit works as well.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Hadoop-InputFormat-in-Python-tp12067p12126.html
Sent from the Apache Spark User List mailing list archive
Thanks, will give that a try.
I see the number of partitions requested is 8 (through HashPartitioner(8)).
If I have a 40 node cluster, whats the recommended number of partitions?
--
View this message in context:
I've created an issue to track this: SPARK-3044: Create RSS feed for Spark
News https://issues.apache.org/jira/browse/SPARK-3044
On Fri, May 30, 2014 at 11:07 AM, Nick Chammas nicholas.cham...@gmail.com
wrote:
Is there a way to subscribe to news releases
Yes. You are right, but I tried old hadoopFile for OrcInputFormat. In hive12,
OrcStruct is not exposing its api, so spark cannot access it. With Hive13, RDD
can read from OrcFile. Btw, I didn’t see ORCNewOutputFormat in hive-0.13.
Direct RDD manipulation (Hive13)
val inputRead =
Hi there,
I have several large files (500GB per file) to transform into Parquet format
and write to HDFS. The problems I encountered can be described as follows:
1) At first, I tried to load all the records in a file and then used
sc.parallelize(data) to generate RDD and finally used
First, I think you might have a misconception about partitioning. ALL RDDs
are partitioned (even if they are a single partition). When reading from
HDFS the number of partitions depends on how the data is stored in HDFS.
After data is shuffled (generally caused by things like reduceByKey), the
Hi,
Do any one have specific documentation for integrating Spark with hadoop
distribution(does not already have spark) ?
Thanks,
Abhilash
Hi Davies,
I tried the second option and launched my ec2 cluster with master on all
the slaves by providing the latest commit hash of master as the
--spark-version option to the spark-ec2 script. However, I am getting the
same errors as before. I am running the job with the original
The errors are occurring in the exact same time in the job as
well..right at the end of the groupByKey() when 5 tasks are left.
On Thu, Aug 14, 2014 at 12:59 PM, Arpan Ghosh ar...@automatic.com wrote:
Hi Davies,
I tried the second option and launched my ec2 cluster with master on all
Thanks Daniel for the detailed information. Since the RDD is already
partitioned, there is no need to worry about repartitioning.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
Sent from the Apache Spark User
I agree. We need the support similar to parquet file for end user. That’s the
purpose of Spark-2883.
Thanks.
Zhan Zhang
On Aug 14, 2014, at 11:42 AM, Yin Huai huaiyin@gmail.com wrote:
I feel that using hadoopFile and saveAsHadoopFile to read and write ORCFile
are more towards
Hi,
I was reading the documentation at http://hortonworks.com/labs/spark/
and it seems to say that Spark is not ready for enterprise, which I
think is not quite right. What I think they wanted to say is Spark on
HDP is not ready for enterprise. I was wondering if someone here is
using Spark on
I have run into that issue too, but only when the data were not
pre-processed correctly. E.g., if a categorical feature is binary with
values in {-1, +1} instead of {0,1}. Will be very interested to learn if
it can occur elsewhere!
On Thu, Aug 14, 2014 at 10:16 AM, Sameer Tilak
The reason we are not using MLLib and Breeze is the lack of control over the
data and performance. After computing the covariance matrix, there isn't too
much we can do after that. Many of the methods are private. For now, we need
the max value and the coresponding pair of columns. Later, we may
For those whom were not able to attend the Seattle Spark Meetup - Spark at eBay
- Troubleshooting the Everyday Issues, the slides have been now posted at:
http://files.meetup.com/12063092/SparkMeetupAugust2014Public.pdf.
Enjoy!
Denny
Hi,
How would you implement the batch layer of lamda architecture with
spark/spark streaming?
Thanks,
Ali
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-lamda-architecture-tp12142.html
Sent from the Apache Spark User List mailing list
Hi,
I am using Spark 1.0.1. But I am still not able to see the stats for
completed apps on port 4040 - only for running apps. Is this feature
supported or is there a way to log this info to some file? I am interested
in stats about the total # of executors, total runtime, and total memory
used by
Hi,
For our large ALS runs, we are considering using sc.setCheckPointDir so
that the intermediate factors are written to HDFS and the lineage is
broken...
Is there a comparison which shows the performance degradation due to these
options ? If not I will be happy to add experiments with it...
We have our quantitative team using Spark as part of their daily work. One
of the more common problems we run into is that people unintentionally
leave their shells open throughout the day. This eats up memory in the
cluster and causes others to have limited resources to run their jobs.
With
Hi, I'm having trouble compiling a snapshot, any advice would be
appreciated. I get the error below when compiling either master or
branch-1.1. The key error is, I believe, [ERROR] File name too long
but I don't understand what it is referring to. Thanks!
./make-distribution.sh --tgz
There may be cases where you want to adjust the number of partitions or
explicitly call RDD.repartition or RDD.coalesce. However, I would start
with the defaults and then adjust if necessary to improve performance (for
example, if cores are idling because there aren't enough tasks you may want
If I don't understand you wrong, setting event logging in the SPARK_JAVA_OPTS
should achieve what you want. I'm logging to the HDFS, but according to the
config page http://spark.apache.org/docs/latest/configuration.html a
folder should be possible as well.
Example with all other settings
Hi all,
As Simon explained, you need to set spark.eventLog.enabled to true.
I'd like to add that the usage of SPARK_JAVA_OPTS to set spark
configurations is deprecated. I'm sure many of you have noticed this from
the scary warning message we print out. :) The recommended and supported
way of
I finally solved the problem by following code
var m: org.apache.spark.mllib.classification.LogisticRegressionModel = null
m = newModel // newModel is the loaded one, see above post of mine
val labelsAndPredsOnGoodData = goodDataPoints.map { point =
val prediction =
Hi,
I have launched an AWS Spark cluster using the spark-ec2 script
(--hadoop-major-version=1). The ephemeral-HDFS is setup correctly and I can
see the name node at master hostname:50070. When I try to copy files from
S3 into ephemeral-HDFS using distcp using the following command:
I set spark.eventLog.enabled to true in
$SPARK_HOME/conf/spark-defaults.conf and also configured the logging to a
file as well as console in log4j.properties. But I am not able to get the
log of the statistics in a file. On the console there is a lot of log
messages along with the stats - so
Hi all, trying to change defaults of where stuff gets written.
I've set -Dspark.local.dir=/spark/tmp and I can see that the setting is
used when the executor is started.
I do indeed see directories like spark-local-20140815004454-bb3f in this
desired location but I also see undesired stuff under
Can you be a bit more specific about what you mean by lambda architecture?
On Thu, Aug 14, 2014 at 2:27 PM, salemi alireza.sal...@udo.edu wrote:
Hi,
How would you implement the batch layer of lamda architecture with
spark/spark streaming?
Thanks,
Ali
--
View this message in context:
Could you try increasing the number of slices with the large data set ?
SparkR assumes that each slice (or partition in Spark terminology) can fit
in memory of a single machine. Also is the error happening when you do the
map function or does it happen when you combine the results ?
Thanks
Hi SK,
Not sure if I understand you correctly, but here is how the user normally
uses the event logging functionality:
After setting spark.eventLog.enabled and optionally spark.eventLog.dir,
the user runs his/her Spark application and calls sc.stop() at the end of
it. Then he/she goes to the
below is what is what I understand under lambda architecture. The batch layer
provides the historical data and the speed layer provides the real-time
view!
All data entering the system is dispatched to both the batch layer and the
speed layer for processing.
The batch layer has two functions:
I've had this issue too running Spark 1.0.0 on YARN with HDFS: it
defaults to a working directory located in hdfs:///user/$USERNAME and
it's not clear how to set the working directory.
In the case where HDFS has a non-standard directory structure (i.e.,
home directories located in hdfs:///users/)
How would you implement the batch layer of lamda architecture with
spark/spark streaming?
I assume you’re familiar with resources such as
https://speakerdeck.com/mhausenblas/lambda-architecture-with-apache-spark and
are after more detailed advices?
Cheers,
Michael
--
Hi Ali,
Maybe you can take a look at twitter's Summingbird project
(https://github.com/twitter/summingbird), which is currently one of the few
open source choices of lambda Architecture. There's a undergoing sub-project
called summingbird-spark, that might be the one you wanted, might this can
Hi Guys,
I have a serious problem regarding the 'None' in RDD(pyspark).
Take a example of transformations that produce 'None'.
leftOuterJoin(self, other, numPartitions=None)
Perform a left outer join of self and other. (K, V) and (K, W), returns a
dataset of (K, (V, W)) pairs with all
I couldn’t reproduce the exception, probably it’s solved in the latest code.
From: Vishal Vibhandik [mailto:vishal.vibhan...@gmail.com]
Sent: Thursday, August 14, 2014 11:17 AM
To: user@spark.apache.org
Subject: Spark SQL Stackoverflow error
Hi,
I tried running the sample sql code JavaSparkSQL
68 matches
Mail list logo