I tried to adjust stepSize between 1e-4 and 1, it doesn't seem to be the
problem. Actually the problem is that the model doesn't use the intercept.
So what happens is that it tries to compensate with super heavy weights (
1e40) and ends up overflowing the model coefficients. MSE is exploding too,
Well, why not, but IMHO MLLib Logistic Regression is unusable right now.
The inability to use intercept is just a no-go. I could hack a column of
ones to inject the intercept into the data but frankly it's a pithy to have
to do so.
2014-07-05 23:04 GMT+02:00 DB Tsai dbt...@dbtsai.com:
You may
Hi again, and thanks for your reply!
On Fri, Jul 4, 2014 at 8:45 PM, Michael Armbrust mich...@databricks.com wrote:
Sweet. Any idea about when this will be merged into master?
It is probably going to be a couple of weeks. There is a fair amount of
cleanup that needs to be done. It works
I need a variable to be broadcasted from driver to executor processes in my
spark java application. I tried using spark broadcast mechanism to achieve,
but no luck there.
Could someone help me doing this, share some code probably ?
Thanks,
Praveen R
Hi Praveen:
It may be easier for other people to help you if you provide more details about
what you are doing. It may be worthwhile to also mention which spark version
you are using. And if you can share the code which doesn't work for you, that
may also give others more clues as to what you
Ok, I've tried to add the intercept term myself (code here [1]), but with
no luck.
It seems that adding a column of ones doesn't help with convergence either.
I may have missed something in the coding as I'm quite a noob in Scala, but
printing the data seem to indicate I succeeded in adding the
guys, I'm not talking about running spark on VM, I don have problem with it.
I confused in the next:
1) Hortonworks describe installation process as RPMs on each node
2) spark home page said that everything I need is YARN
And I'm in stucj with understanding what I need to do to run spark on yarn
Thanks guys! Actually, I'm not doing any caching (at least I'm not calling
cache/persist), do I still need to use the DISK_ONLY storage level?
However, I do use reduceByKey and sortByKey. Mayur, you mentioned that
sortByKey requires data to fit the memory. Is there any way to work around
this
Thank you for all the replies!
Realizing that I can't distribute the modelling with different
cross-validation folds to the cluster nodes this way (but to the threads
only), I decided not to create nfolds data sets but to parallelize the
calculation (threadwise) over folds and to zip the
Hi,
I am planning to do graph(social network) computation on a cluster(hadoop has
been installed), but it seems there are a Pre-built package for hadoop which
I am NOT sure if the graphX has been included in.
or, should I install other released version(obviously the graphX has been
included)?
Konstantin,
1. You need to install the hadoop rpms on all nodes. If it is Hadoop 2,
the nodes would have hdfs YARN.
2. Then you need to install Spark on all nodes. I haven't had experience
with HDP, but the tech preview might have installed Spark as well.
3. In the end, one should
thank you Krishna!
Could you please explain why do I need install spark on each node if Spark
official site said: If you have a Hadoop 2 cluster, you can run Spark
without any installation needed
I have HDP 2 (YARN) and that's why I hope I don't need to install spark on
each node
Thank you,
Using persist() is a sort of a hack or a hint (depending on your
perspective :-)) to make the RDD use disk, not memory. As I mentioned
though, the disk io has consequences, mainly (I think) making sure you have
enough disks to not let io be a bottleneck.
Increasing partitions I think is the other
When I use spark-submit (along with spark-ec2), I get dependency
conflicts. spark-assembly includes older versions of apache commons
codec and httpclient, and these conflict with many of the libs our
software uses.
Is there any way to resolve these? Or, if we use the precompiled
spark, can we
In Yarn cluster mode, you can either have spark on all the cluster nodes or
supply the spark jar yourself. In the 2nd case, you don't need install spark on
cluster at all. As you supply the spark assembly as we as your app jar
together.
I hope this make it clear
Chester
Sent from my iPhone
you could only do the deep check if the hashcodes are the same and design
hashcodes that do not take all elements into account.
the alternative seems to be putting cache statements all over graphx, as is
currently the case, which is trouble for any long lived application where
caching is
I have a basic spark streaming job that is watching a folder, processing
any new file and updating a column family in cassandra using the new
cassandra-spark-driver.
I think there is a problem with SparkStreamingContext.textFileStream... if
I start my job in local mode with no files in the folder
Hi Chester,
Thank you very much, it is clear now - just two different way to support
spark on acluster
Thank you,
Konstantin Kudryavtsev
On Mon, Jul 7, 2014 at 3:22 PM, Chester @work ches...@alpinenow.com wrote:
In Yarn cluster mode, you can either have spark on all the cluster nodes
or
Hi,
I was wondering what was the state of the Pig+Spark initiative now that the
execution engine of Pig is pluggable? Granted, it was done in order to use
Tez but could it be used by Spark? I know about a 'theoretical' project
called Spork but I don't know any stable and maintained version of it.
Hi,
We have fixed many major issues around Spork deploying it with some
customers. Would be happy to provide a working version to you to try out.
We are looking for more folks to try it out submit bugs.
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Hello Experts,
I am doing some comparative study on the below:
Spark vs Impala
Spark vs MapREduce . Is it worth migrating from existing MR implementation to
Spark?
Please share your thoughts and expertise.
Thanks,
Santosh
This message is for the designated
That version is old :).
We are not forking pig but cleanly separating out pig execution engine. Let
me know if you are willing to give it a go.
Also would love to know what features of pig you are using ?
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Hi all,
is it any way to control the number tasks per stage?
currently I see situation when only 2 tasks are created per stage and each
of them is very slow, at the same time cluster has a huge number of unused
nodes
Thank you,
Konstantin Kudryavtsev
Hi, we're planning to add a basic Java-API very soon, possibly this week.
There's a ticket for it here:
https://github.com/datastax/cassandra-driver-spark/issues/11
We're open to any ideas. Just let us know what you need the API to have in
the comments.
Regards,
Piotr Kołaczkowski
2014-07-05
The default number of tasks when reading files is based on how the files
are split among the nodes. Beyond that, the default number of tasks after a
shuffle is based on the property spark.default.parallelism. (see
http://spark.apache.org/docs/latest/configuration.html).
You can use
It seems to me a setup issue. I just tested news20.binary (1355191
features) on a 2-node EC2 cluster and it worked well. I added one line
to conf/spark-env.sh:
export SPARK_JAVA_OPTS= -Dspark.akka.frameSize=20
and launched spark-shell with --driver-memory 20g. Could you re-try
with an EC2
No, but it should be easy to add one. -Xiangrui
On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander
alexander.ula...@hp.com wrote:
Hi,
Is there a method in Spark/MLlib to convert DenseVector to SparseVector?
Best regards, Alexander
i noticed that some algorithms such as graphx liberally cache RDDs for
efficiency, which makes sense. however it can also leave a long trail of
unused yet cached RDDs, that might push other RDDs out of memory.
in a long-lived spark context i would like to decide which RDDs stick
around. would it
I think tiers/priorities for caching are a very good idea and I'd be
interested to see what others think. In addition to letting libraries cache
RDDs liberally, it could also unify memory management across other parts of
Spark. For example, small shuffles benefit from explicitly keeping the
spark-submit includes a spark-assembly uber jar, which has older
versions of many common libraries. These conflict with some of the
dependencies we need. I have been racking my brain trying to find a
solution (including experimenting with ProGuard), but haven't been
able to: when we use
spark-submit includes a spark-assembly uber jar, which has older
versions of many common libraries. These conflict with some of the
dependencies we need. I have been racking my brain trying to find a
solution (including experimenting with ProGuard), but haven't been
able to: when we use
spark has a setting to put user jars in front of classpath, which should do
the trick.
however i had no luck with this. see here:
https://issues.apache.org/jira/browse/SPARK-1863
On Mon, Jul 7, 2014 at 1:31 PM, Robert James srobertja...@gmail.com wrote:
spark-submit includes a spark-assembly
Others have also asked for this on the mailing list, and hence there's a
related JIRA: https://issues.apache.org/jira/browse/SPARK-1762. Ankur
brings up a good point in that any current implementation of in-memory
shuffles will compete with application RDD blocks. I think we should
definitely add
Hi All,I am having the following issue -- may be fqdn/ip resolution issue, but
not sure, any help with this will be great!
On the master node I get the following error:I start master using
./start-master.shstarting org.apache.spark.deploy.master.Master, logging to
Hi guys,
Not sure if you have similar issues. Did not find relevant tickets in
JIRA. When I deploy the Spark Streaming to YARN, I have following two
issues:
1. The UI port is random. It is not default 4040. I have to look at the
container's log to check the UI port. Is this suppose to be this
I opened JIRA issue with Spark, as an improvement though, not as a bug.
Hopefully, someone there would notice it.
From: Tobias Pfeiffer t...@preferred.jpmailto:t...@preferred.jp
Reply-To: user@spark.apache.orgmailto:user@spark.apache.org
user@spark.apache.orgmailto:user@spark.apache.org
Date:
Hi list,
I'm writing a Spark Streaming program that reads from a kafka topic,
performs some transformations on the data, and then inserts each record in
a database with foreachRDD. I was wondering which is the best way to handle
the connection to the database so each worker, or even each task,
I will assume that you are running in yarn-cluster mode. Because the driver
is launched in one of the containers, it doesn't make sense to expose port
4040 for the node that contains the container. (Imagine if multiple driver
containers are launched on the same node. This will cause a port
Hi all,
I'm writing a Spark Streaming program that uses reduceByKeyAndWindow(), and
when I change the windowsLenght or slidingInterval I get the following
exceptions, running in local mode
14/07/06 13:03:46 ERROR actor.OneForOneStrategy: key not found:
1404677026000 ms
Hi Andrew,
Thanks for the quick reply. It works with the yarn-client mode.
One question about the yarn-cluster mode: actually I was checking the AM
for the log, since the spark driver is running in the AM, the UI should
also work, right? But that is not true in my case.
Best,
Fang, Yan
I don't have experience deploying to EC2. can you use add.jar conf to add
the missing jar at runtime ? I haven't tried this myself. Just a guess.
On Mon, Jul 7, 2014 at 12:16 PM, Chester Chen ches...@alpinenow.com wrote:
with provided scope, you need to provide the provided jars at the
Thanks - that did solve my error, but instead got a different one:
java.lang.NoClassDefFoundError:
org/apache/hadoop/mapreduce/lib/input/FileInputFormat
It seems like with that setting, spark can't find Hadoop.
On 7/7/14, Koert Kuipers ko...@tresata.com wrote:
spark has a setting to put user
Has anyone reported issues using SparkSQL with sequence files (all of our
data is in this format within HDFS)? We are considering whether to burn
the time upgrading to Spark 1.0 from 0.9 now and this is a main decision
point for us.
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained; Spark's API feels much more natural to
me. Testing and local development is also very easy - creating a local
Spark context is trivial and it reads local files. For your unit tests you
can
Thanks Daniel for sharing this info.
Regards,
Santosh Karthikeyan
From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Tuesday, July 08, 2014 1:10 AM
To: user@spark.apache.org
Subject: Re: Comparative study
From a development perspective, I vastly prefer Spark to MapReduce. The
@Yan, the UI should still work. As long as you look into the container that
launches the driver, you will find the SparkUI address and port. Note that
in yarn-cluster mode the Spark driver doesn't actually run in the
Application Manager; just like the executors, it runs in a container that
is
Thank you, Andrew. That makes sense for me now. I was confused by In
yarn-cluster mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster in
http://spark.apache.org/docs/latest/running-on-yarn.html . After
you explanation, it's clear now. Thank you.
@Andrew
Yes, the link point to the same redirected
http://localhost/proxy/application_1404443455764_0010/
I suspect something todo with the cluster setup. I will let you know
once I found something.
Chester
On Mon, Jul 7, 2014 at 1:07 PM, Andrew Or and...@databricks.com wrote:
xtrahotsauce wrote
I had this same problem as well. I ended up just adding the necessary
code
in KafkaUtil and compiling my own spark jar. Something like this for the
raw stream:
def createRawStream(
jssc: JavaStreamingContext,
kafkaParams: JMap[String, String],
i was testing using the acl for spark ui in secure mode on yarn in client
mode.
it works great. my spark 1.0.0 configuration has:
spark.authenticate = true
spark.ui.acls.enable = true
spark.ui.view.acls = koert
spark.ui.filters =
Hi Kudryavtsev,
Here's what I am doing as a common practice and reference, I don't want to say
it is best practice since it requires a lot of customer experience and
feedback, but from a development and operating stand point, it will be great to
separate the YARN container logs with the Spark
Hi guys,
I'm running Spark 1.0.0 with Tachyon 0.4.1, both in single node mode.
Tachyon's own tests (./bin/tachyon runTests) works good, and manual file
system operation like mkdir works well. But when I tried to run a very
simple Spark task with RDD persist as OFF_HEAP, I got the following
I found it quite painful to figure out all the steps required and have
filed SPARK-2394 https://issues.apache.org/jira/browse/SPARK-2394 to
track improving this. Perhaps I have been going about it the wrong way, but
it seems way more painful than it should be to set up a Spark cluster built
using
Hi,
I hope someone can help as I’m not sure if I’m using Spark correctly.
Basically, in the simple example below
I create an RDD which is just a sequence of random numbers. I then have a loop
where I just invoke rdd.count()
what I can see is that the memory use always nudges upwards.
If I
Hey, Cheney,
The problem is still existing?
Sorry for the delay, I’m starting to look at this issue,
Best,
--
Nan Zhu
On Tuesday, May 6, 2014 at 10:06 PM, Cheney Sun wrote:
Hi Nan,
In worker's log, I see the following exception thrown when try to launch on
executor. (The
For Scala API on map/reduce (hadoop engine) there's a library called
Scalding. It's built on top of Cascading. If you have a huge dataset or
if you consider using map/reduce engine for your job, for any reason, you
can try Scalding.
However, Spark vs Impala doesn't make sense to me. It should've
On Tue, Jul 8, 2014 at 1:05 AM, Nabeel Memon nm3...@gmail.com wrote:
For Scala API on map/reduce (hadoop engine) there's a library called
Scalding. It's built on top of Cascading. If you have a huge dataset or
if you consider using map/reduce engine for your job, for any reason, you
can try
I haven't heard any reports of this yet, but I don't see any reason why it
wouldn't work. You'll need to manually convert the objects that come out of
the sequence file into something where SparkSQL can detect the schema (i.e.
scala case classes or java beans) before you can register the RDD as a
Daniel,
Do you mind sharing the size of your cluster and the production data volumes ?
Thanks
Soumya
On Jul 7, 2014, at 3:39 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
From a development perspective, I vastly prefer Spark to MapReduce. The
MapReduce API is very constrained;
Hi Subacini,
Just want to follow up on this issue. SPARK-2339 has been merged into the
master and 1.0 branch.
Thanks,
Yin
On Tue, Jul 1, 2014 at 2:00 PM, Yin Huai huaiyin@gmail.com wrote:
Seems it is a bug. I have opened
https://issues.apache.org/jira/browse/SPARK-2339 to track it.
SchemaRDDs, provided by Spark SQL, have a saveAsParquetFile command. You
can turn a normal RDD into a SchemaRDD using the techniques described here:
http://spark.apache.org/docs/latest/sql-programming-guide.html
This should work with Impala, but if you run into any issues please let me
know.
Hi,
I'm trying to understand the relationship of the number of cores and the
number of executors when running a Spark job on YARN.
The test environment is as follows:
- # of data nodes: 3
- Data node machine spec:
- CPU: Core i7-4790 (# of cores: 4, # of threads: 8)
- RAM: 32GB (8GB
Hi,
I am running some simple samples for my project. Right now the spark sample is
running on Hadoop 2.2 with YARN. My question is what is the main different when
we run as spark-client and spark-cluster except different way to submit our
job. And what is the specific way to configure the job
Hi Gray,
Like Michael mentioned, you need to take care of the scala case classes or java
beans, because SparkSQL need the schema.
Currently we are trying insert our data to HBase with Scala 2.10.4 and Spark
1.0.
All the data are tables. We created one case class for each rows, which means
We know Scala 2.11 has remove the limitation of parameter number, but
Spark 1.0 is not compatible with it. So now we are considering use java
beans instead of Scala case classes.
You can also manually create a class that implements scala's Product
interface. Finally, SPARK-2179
Actually, the one needed to install the jar to each individual node is
standalone mode which works for both MR1 and MR2. Cloudera and
Hortonworks currently support spark in this way as far as I know.
For both yarn-cluster or yarn-client, Spark will distribute the jars
through distributed cache
I typed spark parquet into google and the top results was this blog post
about reading and writing parquet files from spark
http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/
On Mon, Jul 7, 2014 at 5:23 PM, Michael Armbrust mich...@databricks.com
wrote:
SchemaRDDs, provided by Spark
Hi All,
This is a bit late, but I found it helpful. Piggy-backing on Wang Hao's
comment, spark will ignore the spark.executor.memory setting if you add
it to SparkConf via:
conf.set(spark.executor.memory, 1g)
What you actually should do depends on how you run spark. I found some
official
Hi,
Sailthru is also using Spark. Could you please add us to the Powered By
Spark https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark page
when you have a chance?
Organization Name: Sailthru
URL: www.sailthru.com
Short Description: Our data science platform uses Spark to build
Juan,
I am doing something similar, just not insert into SQL database, but
issue some RPC call. I think mapPartitions() may be helpful to you. You
could do something like
dstream.mapPartitions(iter = {
val db = new DbConnection()
// maybe only do the above if !iter.isEmpty
iter.map(item =
spark-clinet mode runs driver in your application's JVM while
spark-cluster mode runs driver in yarn cluster.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Mon, Jul 7, 2014 at 5:44 PM,
The names of the directories that are created for the metastore are
different (metastore vs metastore_db), but that should be it. Really
we should get rid of LocalHiveContext as it is mostly redundant and the
current state is kind of confusing. I've created a JIRA to figure this out
before the
hi,maillist :
i download the pre-built spark packages for CDH4 ,but it say can
not support yarn ,why? i need build it by myself with yarn support enable?
Hi Michael,
Thanks for the reply.
Actually last week I tried to play with Product interface, but I'm not really
sure I did correct or not. Here is what I did:
1. Created an abstract class A with Product interface, which has 20 parameters,
2. Created case class B extends A, and B has 20
I know the order of processing DStream is guaranteed. Wondering if the
order of messages in one DStream is guaranteed. My gut feeling is yes for
the question because RDD is immutable. Some simple tests prove this. Want
to hear from authority to persuade myself. Thank you.
Best,
Fang, Yan
For your information, I've attached the Ganglia monitoring screen capture on
the Stack Overflow question.
Please see:
http://stackoverflow.com/questions/24622108/apache-spark-the-number-of-cores
-vs-the-number-of-executors
From: innowireless TaeYun Kim [mailto:taeyun@innowireless.co.kr]
Hi guys, previously I checked out the old spork and updated it to Hadoop 2.0,
Scala 2.10.3 and Spark 0.9.1, see github project of mine
https://github.com/pelick/flare-spork
It it also highly experimental, and just directly mapping pig physical
operations to spark RDD transformations/actions.
Here is a simple example of registering an RDD of Products as a table. It
is important that all of the fields are val defined in the constructor and
that you implement canEqual, productArity and productElement.
class Record(val x1: String) extends Product with Serializable {
def canEqual(that:
Hi Madhu,
I don't think you can reuse the persistent RDD the next time you run the
program, because the folder for RDD materialization will be changed, also Spark
will lose the information of how to retrieve the previous persisted RDD.
AFAIK Spark has fault tolerance mechanism, node failure
The only partitioning that is currently supported is through Hive
partitioned tables. Supporting this for parquet as well is on our radar,
but probably won't happen for 1.1.
On Sun, Jul 6, 2014 at 10:00 PM, Raffael Marty ra...@pixlcloud.com wrote:
Does SparkSQL support partitioned parquet
Hi,
A help for the implementation best practice is needed.
The operating environment is as follows:
- Log data file arrives irregularly.
- The size of a log data file is from 3.9KB to 8.5MB. The average is about
1MB.
- The number of records of a data file is from 13 lines to 22000
Hi All,
Does anyone know what the command line arguments to mvn are to generate the
pre-built binary for spark on Hadoop 2-CHD5.
I would like to pull in a recent bug fix in spark-master and rebuild the
binaries in the exact same way that was used for that provided on the
website.
I have tried
If you receive data through multiple receivers across the cluster. I don't
think any order can be guaranteed. Order in distributed systems is tough.
On Tuesday, July 8, 2014, Yan Fang yanfang...@gmail.com wrote:
I know the order of processing DStream is guaranteed. Wondering if the
order of
They are for CDH4 without YARN, since YARN is experimental in that. You can
download one of the Hadoop 2 packages if you want to run on YARN. Or you might
have to build specifically against CDH4's version of YARN if that doesn't work.
Matei
On Jul 7, 2014, at 9:37 PM, ch huang
84 matches
Mail list logo