This is not supported in MLlib. Hopefully, we will add support for
weighted examples in v1.2. If you want to train weighted instances
with the current tree implementation, please try importance sampling
first to adjust the weights. For instance, an example with weight 0.3
is sampled with
We have a pending PR (https://github.com/apache/spark/pull/216) for
discretization but it has performance issues. We will try to spend
more time to improve it. -Xiangrui
On Tue, Sep 2, 2014 at 2:56 AM, filipus floe...@gmail.com wrote:
i guess i found it
howto install? just clone by git clone
https://github.com/apache/spark/pull/216 the code and than sbt package?
is it the same as https://github.com/LIDIAgroup/SparkFeatureSelection ???
or something different
filip
--
View this message in context:
I think they are the same. If you have hub (https://hub.github.com/)
installed, you can run
hub checkout https://github.com/apache/spark/pull/216
and then `sbt/sbt assembly`
-Xiangrui
On Wed, Sep 3, 2014 at 12:03 AM, filipus floe...@gmail.com wrote:
howto install? just clone by git clone
You really should show your Spark code then. I think you are mistaking
one of the Spark APIs, and are processing a collection of 1
ArrayBuffer at some point, not an ArrayBuffer.
On Wed, Sep 3, 2014 at 6:42 AM, Deep Pradhan pradhandeep1...@gmail.com wrote:
I have a problem here.
When I run the
finished a distributed project in hadoop streaming and it worked fine with
using memcached storage during mapping. Actually, it's a python project.
Now I want to move it to Spark. But when I called the memcached library, two
errors was found during computing. (Both)
1. File memcache.py, line 414,
To make my shell experience merrier, I need to import several packages, and
define implicit sparkContext and sqlContext.
Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when
it's started?
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog:
Hi,
Can someone tell me what kind of operations can be performed on a
replicated rdd?? What are the use-cases of a replicated rdd.
One basic doubt that is bothering me from long time: what is the difference
between an application and job in the Spark parlance. I am confused b'cas
of Hadoop
Hi,
I am trying to run query 3 from the TPC-H benchmark using SparkSQL. But, I
am running into errors which I believe are because the parser does not
accept the JOIN syntax I am trying.
Below are the syntax which I tried and the error messages I am seeing .
Exception in thread main
Hey,
I am about to implement a spark app which will require to use both, pyspark and
spark on scala.
Data should be read from AWS S3 (compressed CSV files), and must be
pre-processed by an existing Python codebase. However, our final goal is to
make those datasets available for Spark apps
Hi,
I'm getting the same error while manually setting up Spark cluster.
Has there been any update about this error?
Rgds
Niranda
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Invalid-Class-Exception-tp6859p13346.html
Sent from the Apache Spark User
Does spark ML team have plan to support R script natively? There is a SparkR
project, but not from spark team. Spark ML used netlib-java to talk with native
fortran routines or use NumPy, why not try to use R in some sense.
R had lot of useful packages. If spark ML team can include R support,
Hi,
How can I list all registered tables in a sql context?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
Hey,
You can use spark-shell -i sparkrc, to do this.
Prashant Sharma
On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
To make my shell experience merrier, I need to import several packages,
and define implicit sparkContext and sqlContext.
Is there a startup
Hello everyone. I'm trying to receive a DStream structured as a json from a
kafka topic and I want to parse the content of each json. The json I receive
is something like this:
Hi,
I tried Broadcast.unpersist() on Spark 1.0.1 but MemoryStore(driver memory)
still allocated it.
//LOGS
//Block broadcast_0 stored as values to memory (estimated size 380.1 MB,
free 5.7 GB)
The free size of memory was same after calling unpersist.
Can I clear this?
--
View this message in
Hi all.
I am trying to run pyspark on yarn already couple of days:
http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/
I posted exception on previous posts. It looks that I didn't do correct
configuration.
I googled quite a lot and I can't find the steps should be done to
Hi,
I would like to implement an asynchronous distributed optimization algorithm
where workers communicate among one another. It is similar to belief
propagation where each worker is a vertex in the graph. Can some one let me
know if this is possible using spark?
Thanks,
Laxman
--
View this
Asynchrony is not supported directly - spark's programming model is
naturally BSP. I have seen cases where people have instantiated actors with
akka on worker nodes to enable message passing, or even used spark's own
ActorSystem to do this. But, I do not recommend this, since you lose a
bunch of
Hi Ankur,
Thanks so much for your advice.
But it failed when I tried to set the storage level in constructing a graph.
val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)
Are you using SQLContext or HiveContext? The default sql dialect in
HiveContext (HiveQL) is a little more complete and might be a better place
to start.
On Wed, Sep 3, 2014 at 2:12 AM, Samay smilingsa...@gmail.com wrote:
Hi,
I am trying to run query 3 from the TPC-H benchmark using
Dear Xiangrui,
Thanks for your reply. We will use sampling for now. However, just to let you
know, we believe that it is not the best fit for our problems due to two
reasons (1) high dimensionality of data (600) features and (2) Highly skewed
distribution.
Do you have any idea when MLLib v1.2
Hi all,
Assume I have read the lines of a text file into an RDD:
textFile = sc.textFile(SomeArticle.txt)
Also assume that the sentence breaks in SomeArticle.txt were done by machine
and have some errors, such as the break at Fig. in the sample text below.
Index Text
N...as shown
Hi,
Sorry for little delay . As discussed in this thread, I have modified the
Kafka-Spark-Consumer ( https://github.com/dibbhatt/kafka-spark-consumer)
code to have dedicated Receiver for every Topic Partition. You can see the
example howto create Union of these receivers
in
Interestingly, there was an almost identical question posed on Aug 22 by
cjwang. Here's the link to the archive:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-td12621.html#a12664
On Wed, Sep 3, 2014 at 10:33 AM, Daniel, Ronald (ELS-SDG)
There is support for Spark in ElasticSearch’s Hadoop integration package.
http://www.elasticsearch.org/guide/en/elasticsearch/hadoop/current/spark.html
Maybe you could split and insert all of your documents from Spark and then
query for “MoreLikeThis” on the ElasticSearch index. I haven’t
Hi,
I have seted properties in conf/spark-defaults.conf and start with command
./sbin/start-history-server.sh /tmp/spark-events. It get errors and seems
that the properties in spark-defaults.conf file doesn't effect. How can I
solve this problem(Enable properties in spark-defaults.conf
The history server (and other Spark daemons) do not read
spark-defaults.conf. There's a bug open to implement that
(SPARK-2098), and an open PR to fix it, but it's still not in Spark.
On Wed, Sep 3, 2014 at 11:00 AM, Zhanfeng Huo huozhanf...@gmail.com wrote:
Hi,
I have seted properties in
I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
Thanks for the pointer to that thread. Looks like there is some demand for this
capability, but not a lot yet. Also doesn't look like there is an easy answer
right now.
Thanks,
Ron
From: Victor Tso-Guillen [mailto:v...@paxata.com]
Sent: Wednesday, September 03, 2014 10:40 AM
To: Daniel,
My Spark history server won't start because it's trying to hit the namenode on
8021, but the namenode is on 8020 (the default). How can I configure the
history server to use the right port? I can't find any relevant setting on the
docs:
Hello,
What is included in the Spark web UI? What are the available URLs? Can the
information be obtained in a machine-readable way (e.g. JSON, XML, etc)?
Thanks!
Best,
Oliver
Oliver Ruebenacker | Solutions Architect
Altisource(tm)
290 Congress St, 7th Floor | Boston,
Nevermind, PEBKAC. I had put in the wrong port in the $LOG_DIR environment
variable.
Greg
From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Wednesday, September 3, 2014 1:56 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Hi Oliver,
Spark standalone master and worker support '/json' endpoint in web UI,
which returns some of the information in JSON format.
I wasn't able to find relevant documentation, though.
- Wonha
On Wed, Sep 3, 2014 at 12:12 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
Hello,
Thanks for the help! But I tried starting with “–master local[4]” and when I
load http://localhost:4040/json I just get forwarded to
http://localhost:4040/stages/, and it’s all human-readable HTML, no JSON.
Best,
Oliver
From: Wonha Ryu [mailto:wonha@gmail.com]
There is a sliding method implemented in MLlib
(https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/rdd/SlidingRDD.scala),
which is used in computing Area Under Curve:
Hey Oliver,
IIRC there's no JSON endpoint for application web UI. They only exist for
cluster master and worker.
- Wonha
On Wed, Sep 3, 2014 at 12:58 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
Hello,
Thanks for the help! But I tried starting with
At 2014-09-03 17:58:09 +0200, Yifan LI iamyifa...@gmail.com wrote:
val graph = GraphLoader.edgeListFile(sc, edgesFile, minEdgePartitions =
numPartitions).partitionBy(PartitionStrategy.EdgePartition2D).persist(StorageLevel.MEMORY_AND_DISK)
Error: java.lang.UnsupportedOperationException: Cannot
Hello,
Interestingly, http://localhost:4040/metrics/json/ gives some numbers, but
only a few which never seem to change during the application’s lifetime.
Either the web UI has some very strange limitations, or there are some URLs
yet to be discovered that do something interesting.
I have been trying to understand how spark streaming and hbase connect, but
have not been successful. What I am trying to do is given a spark stream,
process that stream and store the results in an hbase table. So far this is
what I have:
import org.apache.spark.SparkConf
import
Hi guys,
curious how you deal with the logs. I feel difficulty in debugging with the
logs: run spark-streaming in our yarn cluster using client-mode. So have
two logs: yarn log and local log ( for client ). Whenever I have problem,
the log is too big to read with gedit and grep. (e.g. after
Hi Greg,
For future references you can set spark.history.ui.port in
SPARK_HISTORY_OPTS. By default this should point to 18080. This information
is actually in the link that you provided :) (as well as the most updated
docs here: http://spark.apache.org/docs/latest/monitoring.html)
-Andrew
Hi Oleg,
There isn't much you need to do to setup a Yarn cluster to run PySpark. You
need to make sure all machines have python installed, and... that's about
it. Your assembly jar will be shipped to all containers along with all the
pyspark and py4j files needed. One caveat, however, is that the
Hi Kevin, there is currently no way to do this... Broadcast.unpersist()
only unpersists it from the executors, but not from the driver. However,
this is not that bad because the Spark automatically cleans up broadcasts
that are no longer used, even on the driver. So as long as there is no
memory
Adding back user@
I am not familiar with the NotSerializableException. Can you show the full
stack trace ?
See SPARK-1297 for changes you need to make so that Spark works with hbase
0.98
Cheers
On Wed, Sep 3, 2014 at 2:33 PM, Kevin Peng kpe...@gmail.com wrote:
Ted,
The hbase-site.xml is
This doesn't seem to have to do with HBase per se. Some function is
getting the StreamingContext into the closure and that won't work. Is
this exactly the code? since it doesn't reference a StreamingContext,
but is there maybe a different version in reality that tries to use
StreamingContext
Thanks Xiangrui, that looks very helpful.
Best regards,
Ron
-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Wednesday, September 03, 2014 1:19 PM
To: Daniel, Ronald (ELS-SDG)
Cc: Victor Tso-Guillen; user@spark.apache.org
Subject: Re: Accessing neighboring
Sean,
I create a streaming context near the bottom of the code (ssc) and
basically apply a foreachRDD on the resulting DStream so that I can get
access to the underlying RDD, which in return I apply a foreach on and pass
in my function which applies the storing logic.
Is there a different
Hello,
If launched with local as master, where are master and workers? Do they
each have a web UI? How can they be monitored?
Thanks!
Best,
Oliver
Oliver Ruebenacker | Solutions Architect
Altisource(tm)
290 Congress St, 7th Floor | Boston, Massachusetts 02210
P: (617)
local means everything runs in the same process; that means there is
no need for master and worker daemons to start processes.
On Wed, Sep 3, 2014 at 3:12 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
Hello,
If launched with “local” as master, where are master
How can that single process be monitored? Thanks!
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Wednesday, September 03, 2014 6:32 PM
To: Ruebenacker, Oliver A
Cc: user@spark.apache.org
Subject: Re: If master is local, where are master and workers?
local
The only monitoring available is the driver's Web UI, which will
generally be available on port 4040.
On Wed, Sep 3, 2014 at 3:43 PM, Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com wrote:
How can that single process be monitored? Thanks!
-Original Message-
From: Marcelo
Ted,
Here is the full stack trace coming from spark-shell:
14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming
job 1409786463000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task not
serializable: java.io.NotSerializableException:
In word count, you don’t need much driver memory, unless you do collect, but it
is not recommended.
val file = sc.textFile(hdfs://sandbox.hortonworks.com:8020/tmp/data)
val counts = file.flatMap(line = line.split( )).map(word = (word,
1)).reduceByKey(_ + _)
Hi Zhanfeng,
You will need to set these through SPARK_HISTORY_OPTS in conf/spark-env.sh.
This is documented here: http://spark.apache.org/docs/latest/monitoring.html
.
Let me know if you have it working,
-Andrew
2014-09-03 11:14 GMT-07:00 Marcelo Vanzin van...@cloudera.com:
The history
It seems like the next release will add a nice org.apache.spark.mllib.feature
package but what is the recommended way to normalize features in the
current release (1.0.2) -- I'm hoping for a general pointer here.
At the moment I have a RDD[LabeledPoint] and I can get
a
Hello,
On Wed, Sep 3, 2014 at 6:02 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:
Can someone tell me what kind of operations can be performed on a
replicated rdd?? What are the use-cases of a replicated rdd.
I suggest you read
I found the answer. Here the file system of the checkpoint should be a
fault-tolerant file system like HDFS, so we should set it to a HDFS path.
It is not a local file system path.
2014-09-03 10:28 GMT+08:00 Tao Xiao xiaotao.cs@gmail.com:
I tried to run KafkaWordCount in a Spark
Hi,
I am not sure if multi-tenancy is the right word, but I am thinking about
a Spark application where multiple users can, say, log into some web
interface and specify a data processing pipeline with streaming source,
processing steps, and output.
Now as far as I know, there can be only one
In the current state of Spark Streaming, creating separate Java processes
each having a streaming context is probably the best approach to
dynamically adding and removing of input sources. All of these should be
able to to use a YARN cluster for resource allocation.
On Wed, Sep 3, 2014 at 6:30
Maybe copy the implementation of StandardScaler from 1.1 and use it in
v1.0.x. -Xiangrui
On Wed, Sep 3, 2014 at 5:10 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
It seems like the next release will add a nice
org.apache.spark.mllib.feature package but what is the recommended way to
Not sure what did you refer to when saying replicated rdd, if you actually mean
RDD, then, yes , read the API doc and paper as Tobias mentioned.
If you actually focus on the word replicated, then that is for fault
tolerant, and probably mostly used in the streaming case for receiver created
Hello, I tried to submit a spark job to yarn cluster, there is an error occured
with those messages: [root@saturn00 bin]# ./spark-submit --class SparkHiveJoin
--master yarn-cluster --num-executors 10 --executor-memory 12g --executor-cores
1 spark.jarSpark assembly has been built with Hive,
Hi Xiangrui,
A side-by question about MLLib.
It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support
L2 regurization, the doc explains it: The L1 regularization by using
L1Updater
+DB David (They implemented QWLQN on Spark today.)
On Sep 3, 2014 7:18 PM, Jiusheng Chen chenjiush...@gmail.com wrote:
Hi Xiangrui,
A side-by question about MLLib.
It looks current LBFGS in MLLib (version 1.0.2 and even v1.1) only support
L2 regurization, the doc explains it: The L1
Did you follow the exact step in this page
https://spark.apache.org/docs/1.0.2/running-on-yarn.html ?
Please be sure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the
directory which contains the (client side) configuration files for the
Hadoop cluster.
Guodong
On Thu, Sep 4, 2014 at 10:15
This is some issue with how Scala computes closures. Here because of the
function blah it is trying the serialize the whole function that this code
is part of. Can you define the function blah outside the main function? In
fact you canTry putting the function in a serializable object.
object
With David's help today, we were able to implement elastic net glm in
Spark. It's surprising easy, and with just some modification in breeze's
OWLQN code, it just works without further investigation.
We did benchmark, and the coefficients are within 0.5% differences compared
with R's glmnet
Dear all:
Spark uses memory to cache RDD and the memory size is specified by
spark.storage.memoryFraction.
One the Executor starts, does Spark support adjusting/resizing memory size
of this part dynamically?
Thanks.
--
*Regards,*
*Zhaojie*
tried to connect memcached in map with xmemcached lib, faild:
net.rubyeye.xmemcached.exception.MemcachedException: There is no available
connection at this moment
is there anybody succeed to use memcached?
--
View this message in context:
Thanks, Shivaram.
No specific use case yet. We try to use R in our project as data scientest are
all knowing R. We had a concern that how R handles the mass data. Spark does a
better work on big data area, and Spark ML is focusing on predictive analysis
area. Then we are thinking whether we
AFAIK, No.
Best Regards,
Raymond Liu
From: 牛兆捷 [mailto:nzjem...@gmail.com]
Sent: Thursday, September 04, 2014 11:30 AM
To: user@spark.apache.org
Subject: resize memory size for caching RDD
Dear all:
Spark uses memory to cache RDD and the memory size is specified by
Thanks DB and Xiangrui. Glad to know you guys are actively working on it.
Another thing, did we evaluate the loss of using Float to store values?
currently it is Double. Use fewer bits has the benifit of memory footprint
reduction. According to Google, they even uses 16 bits (a special encoding
Thanks for your help.
It works after setting SPARK_HISTORY_OPTS.
Zhanfeng Huo
From: Andrew Or
Date: 2014-09-04 07:52
To: Marcelo Vanzin
CC: Zhanfeng Huo; user
Subject: Re: How can I start history-server with kerberos HDFS ?
Hi Zhanfeng,
You will need to set these through SPARK_HISTORY_OPTS
Thank you Raymond and Tobias.
Yeah, I am very clear about what I was asking. I was talking about
replicated rdd only. Now that I've got my understanding about job and
application validated, I wanted to know if we can replicate an rdd and run
two jobs (that need same rdd) of an application in
Changing this is not supported, it si immutable similar to other spark
configuration settings.
On Wed, Sep 3, 2014 at 8:13 PM, 牛兆捷 nzjem...@gmail.com wrote:
Dear all:
Spark uses memory to cache RDD and the memory size is specified by
spark.storage.memoryFraction.
One the Executor starts,
When I start the thrift server (on Spark 1.1 RC4) via:
./sbin/start-thriftserver.sh --master spark://hostname:7077 --driver-class-path
$CLASSPATH
It appears that the thrift server is starting off of localhost as opposed to
hostname. I have set the spark-env.sh to use the hostname, modified the
77 matches
Mail list logo