Hi,
I am trying to upgrade from spark v0.91 to v1.0.0 and getting into some wierd
behavior.
When, in pyspark, I invoke
sc.textFile(hdfs://hadoop-ha01:/user/x/events_2.1).take(1) the
call crashes with the below stack trace.
The file resides in hadoop 2.2, it is a large event data,
Hello Xiangrui,
I am looking at the Spark Issues, but just wanted to know, if it is
mandatory for me to work on existing JIRAs before I can contribute to MLLib.
Regards,
Jayati
--
View this message in context:
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar
with the modification you mentioned since the paper is new. We may
need to spend more time to learn the trade-offs. Feel free to create a
JIRA for PLSA and we can move our discussion there. It would be great
if you can share
Yes, I need to call the external service for every event and the order does
not matter.
There's no time limit in which each events should be processed. I can't
tell the producer to slow down nor drop events.
Of course I could put a message broker in between like an AMQP or JMS
broker but I was
Hi Sparkers,
We have a Storm cluster and looking for a decent execution engine for
machine learned models. What I've seen from MLLib is extremely positive,
but we can't just throw away our Storm based stack.
So my question is: is it feasible/recommended to train models in
Spark/MLLib and execute
Hello Flavio,
It sounds to me like the best solution for you is to implement your own
ReceiverInputDStream/Receiver component to feed Spark Streaming with
DStreams. It is not as scary as it sounds, take a look at some of the
examples like TwitterInputDStream
Hi Michael,
thanks for the tip, it's really an elegant solution.
What I'm still missing here (maybe I should take a look at the code of
TwitterInputDStream
https://github.com/apache/spark/blob/master/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala..)
is
Hi Flavio,
When your streaming job starts somewhere in the cluster the Receiver will
be started in its own thread/process. You can do whatever you like within
the receiver e.g. start and manage your own thread pool to fetch external
data and feed Spark. If your Receiver dies Spark will
Xiangrui and Debasish,
(2014/06/18 6:33), Debasish Das wrote:
I did run pretty big sparse dataset (20M rows, 3M sparse features) and I
got 100 iterations of SGD running in 200 seconds...10 executors each
with 16 GB memory...
I could figure out what the problem is. spark.akka.frameSize was too
Ok, I'll try to start from that when I'll try to implement it.
Thanks again for the great support!
Best,
Flavio
On Thu, Jun 19, 2014 at 10:57 AM, Michael Cutler mich...@tumra.com wrote:
Hi Flavio,
When your streaming job starts somewhere in the cluster the Receiver will
be started in its
Hi,
I am trying to use the new Spark history server in 1.0.0 to view finished
applications (launched on YARN), without success so far.
Here are the relevant configuration properties in my spark-defaults.conf:
spark.yarn.historyServer.address=server_name:18080
spark.ui.killEnabled=false
I have a JavaPairDStream whose RDD looks like to be hello, world.
I want it to join with a JavaPairRDD which has one item as hello,
spark.
I expect the joined result to be something like this hello, (world,
spark).
However, I see result to be hello, (world, world).
Is it a bug? Any suggestions
While I can't definitively speak to MLLib online learning,
I'm sure you're evaluating Vowpal Wabbit, for which there's been some storm
integrations contributed.
Also you might look at factorie, http://factorie.cs.understanding.edu,
which at least provides an online lda.
C
On Thursday, June 19,
Well, yes VW is an appealing option but I only found experimental
integrations so far.
Also, early experiments suggest Decision Trees Ensembles (RF, GBT) perform
better than generalized linear models on our data. Hence the interest for
MLLib :)
Any other comments / suggestions welcome :)
E/
Hi,
Can someone please clarify a small query on Graph.pregel operator. As per
the documentation on merge Message function, only two inbound messages can
be merged to a single value. Is it the actual case, if so how can one merge
n inbound messages .
Any help is truly appreciated.
Many Thanks,
I can't speak for MLlib, too. But I can say the model of training in Hadoop
M/R or Spark and production scoring in Storm works very well. My team has
done online learning (Sofia ML library, I think) in Storm as well.
I would be interested in this answer as well.
-Suren
On Thu, Jun 19, 2014 at
I am trying to run Spark on YARN. I have a hadoop 2.2 cluster (YARN +
HDFS) in EC2. Then, I compiled Spark using Maven with 2.2 hadoop profiles.
Now am trying to run the example Spark job . (In Yarn-cluster mode).
From my *local machine. *I have setup HADOOP_CONF_DIR environment variable
On Thu, Jun 19, 2014 at 3:03 PM, redocpot julien19890...@gmail.com wrote:
We did some sanity check. For example, each user has his own item list which
is sorted by preference, then we just pick the top 10 items for each user.
As a result, we found that there were only 169 different items among
still struggling with SPARK_JAVA_OPTS being deprecated. i am using spark
standalone.
for example if i have a akka timeout setting that i would like to be
applied to every piece of the spark framework (so spark master, spark
workers, spark executor sub-processes, spark-shell, etc.). i used to do
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
On Jun 19, 2014, at 2:01 AM, Makoto Yui yuin...@gmail.com wrote:
Xiangrui and Debasish,
(2014/06/18 6:33), Debasish Das wrote:
I did
Here is a partial comparison.
http://dspace.mit.edu/bitstream/handle/1721.1/82517/MIT-CSAIL-TR-2013-028.pdf?sequence=2
SciDB uses MPI with Intel HW and libraries. Amazing performance at the cost
of more work.
In case the link stops working:
A Complex Analytics Genomics Benchmark Rebecca Taft-,
Hi,
I have had this issue for some time already, where I get different answers
when I run the same line of code twice. I have run some experiments to see
what is happening, please help me! Here is the code and the answers that I
get. I suspect I have a problem when reading large datasets from S3.
Larry,
I don't see any reference to Spark in particular there.
Additionally, the benchmark only scales up to datasets that are roughly
10gb (though I realize they've picked some fairly computationally intensive
tasks), and they don't present their results on more than 4 nodes. This can
hide
Xiangrui,
(2014/06/19 23:43), Xiangrui Meng wrote:
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
Not yet. I will wait the v1.0.1 release.
Thanks,
Makoto
On Thu, Jun 19, 2014 at 3:44 PM, redocpot julien19890...@gmail.com wrote:
As the paper said, the low ratings will get a low confidence weight, so if I
understand correctly, these dominant one-timers will be more *unlikely* to
be recommended comparing to other items whose nbPurchase is bigger.
Hi all,
Is there any plan for 1.0.1 release?
Mingyu
smime.p7s
Description: S/MIME cryptographic signature
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can
you tell which stage is the processing of each batch is causing the
increase in the processing time?
Also, what is the batch interval, and checkpoint interval?
TD
On Thu, Jun 19, 2014 at 8:45 AM, Skogberg, Fredrik
Hi All,
I was curious to know which of the two approach is better for doing
analytics using spark streaming. Lets say we want to add some metadata to
the stream which is being processed like sentiment, tags etc and then
perform some analytics using these added metadata.
1) Is it ok to make a
I've created an issue for this but if anyone has any advice, please let me
know.
Basically, on about 10 GBs of data, saveAsTextFile() to HDFS hangs on two
remaining tasks (out of 320). Those tasks seem to be waiting on data from
another task on another node. Eventually (about 2 hours later) they
I'm working on a problem learning several different sets of responses
against the same set of training features. Right now I've written the
program to cycle through all of the different label sets, attached them to
the training data and run LogisticRegressionWithSGD on each of them. ie
foreach
Yarn client is much like Spark client mode, except that the
executors are running in Yarn containers managed by the Yarn
resource manager on the cluster instead of as Spark workers managed
by the Spark master. The driver executes as a local client in your
local JVM. It
zeroTime marks the time when the streaming job started, and the first batch
of data is from zeroTime to zeroTime + slideDuration. The validity check of
time - zeroTime) being multiple of slideDuration is to ensure that for a
given dstream, it generates RDD at the right times. For example, say the
You should be able to use many of the MLlib Model objects directly in Storm, if
you save them out using Java serialization. The only one that won’t work is
probably ALS, because it’s a distributed model.
Otherwise, you will have to output them in your own format and write code for
evaluating
We are submitting the spark job in our tomcat application using
yarn-cluster mode with great success. As Kevin said, yarn-client mode
runs driver in your local JVM, and it will have really bad network
overhead when one do reduce action which will pull all the result from
executor to your local
I want to ask this, not because I can't read endless documentation and
several tutorials, but because there seems to be many ways of doing things
and I keep having issues. How do you run /your /spark app?
I had it working when I was only using yarn+hadoop1 (Cloudera), then I had
to get Spark and
Based on Jacob's suggestion, I started using --net=host which is a new
option in latest version of docker. I also set SPARK_LOCAL_IP to the host's
IP address and then AKKA does not use the hostname and I don't need the
Spark driver's hostname to be resolvable.
Thanks guys for your help!
On Tue,
I use SBT, create an assembly, and then add the assembly jars when I create
my spark context. The main executor I run with something like java -cp ...
MyDriver.
That said - as of spark 1.0 the preferred way to run spark applications is
via spark-submit -
If I'm understanding correctly, you want to use MLlib for offline training
and then deploy the learned model to Storm? In this case I don't think
there is any problem. However if you are looking for online model
update/training, this can be complicated and I guess quite a few algorithms
in mllib
I have similar case where I have RDD [List[Any], List[Long] ] and wants to
save it as Parquet file.
My understanding is that only RDD of case classes can be converted to
SchemaRDD. So is there any way I can save this RDD as Parquet file without
using Avro?
Thanks in advance
Anita
On 18 June
Coincidentally, I just ran into the same exception. What's probably
happening is that you're specifying some jar file in your job as an
absolute local path (e.g. just
/home/koert/test-assembly-0.1-SNAPSHOT.jar), but your Hadoop config
has the default FS set to HDFS.
So your driver does not know
Gino,
I can confirm that your solution of assembling with spark-streaming-kafka
but excluding spark-core and spark-streaming has me working with basic
spark-submit. As mentioned you must specify the assembly jar in the
SparkConfig as well as to spark-submit.
When I see the error you are now
db tsai,
if in yarn-cluster mode the driver runs inside yarn, how can you do a
rdd.collect and bring the results back to your application?
On Thu, Jun 19, 2014 at 2:33 PM, DB Tsai dbt...@stanford.edu wrote:
We are submitting the spark job in our tomcat application using
yarn-cluster mode with
I'll make a comment on the JIRA - thanks for reporting this, let's get
to the bottom of it.
On Thu, Jun 19, 2014 at 11:19 AM, Surendranauth Hiraman
suren.hira...@velos.io wrote:
I've created an issue for this but if anyone has any advice, please let me
know.
Basically, on about 10 GBs of
Hello all,
Apologies for the late response, this thread went below my radar. There are
a number of things that can be done to improve the performance. Here are
some of them of the top of my head based on what you have mentioned. Most
of them are mentioned in the streaming guide's performance
Currently, we save the result in HDFS, and read it back in our
application. Since Clinet.run is blocking call, it's easy to do it in
this way.
We are now working on using akka to bring back the result to app
without going through the HDFS, and we can use akka to track the log,
and stack trace,
I'm trying to write a JavaPairRDD to a downstream database using
saveAsNewAPIHadoopFile with a custom OutputFormat and the process is pretty
slow.
Is there a way to boost the concurrency of the save process? For example,
something like splitting the RDD into multiple smaller RDDs and using Java
okay. since for us the main purpose is to retrieve (small) data i guess i
will stick to yarn client mode. thx
On Thu, Jun 19, 2014 at 3:19 PM, DB Tsai dbt...@stanford.edu wrote:
Currently, we save the result in HDFS, and read it back in our
application. Since Clinet.run is blocking call, it's
Hi,
I'm searching for some information about run programs concurrently on Spark.
I did a simple experiment on the Spark Master, that is open two terminals, then
submitted two programs to Master in the same time and watch the programs status
via Spark Master Web UI. I found one program could
Hi All,
I am attempting to develop some unit tests for a program using pyspark and
scikit-learn and I've come across some weird behavior. I receive the
following warning during some tests python/pyspark/serializers.py:327:
DeprecationWarning: integer argument expected, got float.
Although it's
Hi Koert and Lukasz,
The recommended way of not hard-coding configurations in your application
is through conf/spark-defaults.conf as documented here:
http://spark.apache.org/docs/latest/configuration.html#dynamically-loading-spark-properties.
However, this is only applicable to
spark-submit, so
Hi Praveen,
Yes, the fact that it is trying to use a private IP from outside of the
cluster is suspicious.
My guess is that your HDFS is configured to use internal IPs rather than
external IPs.
This means even though the hadoop confs on your local machine only use
external IPs,
the
(Also, an easier workaround is to simply submit the application from within
your
cluster, thus saving you all the manual labor of reconfiguring everything
to use
public hostnames. This may or may not be applicable to your use case.)
2014-06-19 14:04 GMT-07:00 Andrew Or and...@databricks.com:
So..
Here is my experimental code to get a feel of it
def read_file(filename):
with open(filename) as f:
lines = [ line for line in f]
return lines
files = [/somepath/.../test1.txt,sompath/.../test2.txt]
test1.txt has
foo bar
this is test1
test2.txt
bar foo
this is text2
The main thing that will affect the concurrency of any saveAs...()
operations is a) the number of partitions of your RDD, and b) how many
cores your cluster has.
How big is the RDD in question? How many partitions does it have?
On Thu, Jun 19, 2014 at 3:38 PM, Sandeep Parikh
Hi Justin,
I am glad to know that trees are working well for you.
The trees will support minimum samples per node in a future release. Thanks
for the feedback.
-Manish
On Fri, Jun 13, 2014 at 8:55 PM, Justin Yip yipjus...@gmail.com wrote:
Hello,
I have been playing around with mllib's
When you start seriously using Spark in production there are basically two
things everyone eventually needs:
1. Scheduled Jobs - recurring hourly/daily/weekly jobs.
2. Always-On Jobs - that require monitoring, restarting etc.
There are lots of ways to implement these requirements,
P.S. Last but not least we use sbt-assembly to build fat JAR's and build
dist-style TAR.GZ packages with launch scripts, JAR's and everything needed
to run a Job. These are automatically built from source by our Jenkins and
stored in HDFS. Our Chronos/Marathon jobs fetch the latest release
Hi Justin,
I have created a JIRA ticket to keep track of your request. Thanks.
https://issues.apache.org/jira/browse/SPARK-2207
-Manish
On Thu, Jun 19, 2014 at 2:35 PM, Manish Amde manish...@gmail.com wrote:
Hi Justin,
I am glad to know that trees are working well for you.
The trees will
Hi TD,
Thats quite odd. Yes, with checkpoint the lineage does not increase. Can you
tell which stage is the processing of each batch is causing the increase in
the processing time?
I haven’t been able to determine exactly what stage that is causing the
increase in processing time. Any
for a jvm application its not very appealing to me to use spark submit
my application uses hadoop, so i should use hadoop jar, and my
application uses spark, so it should use spark-submit. if i add a piece
of code that uses some other system there will be yet another suggested way
to launch
I'm starting to develop ADMM for some models using pyspark(Spark version
1.0.0). So I constantly simulated data to test my code. I did simulation in
python but then I ran into the same kind of problems as mentioned above.
Same meaningless error messages show up when I tried methods like first,
Hello Andrew,
i wish I could share the code, but for proprietary reasons I can't. But I
can give some idea though of what i am trying to do. The job reads a file
and for each line of that file and processors these lines. I am not doing
anything intense in the processLogs function
import
Many merge operations can be broken up to work incrementally. For example,
if the merge operation is to sum *n* rank updates, then you can set mergeMsg
= (a, b) = a + b and this function will be applied to all *n* updates in
arbitrary order to yield a final sum. Addition, multiplication, min, max,
Master/worker assignment and barrier synchronization are handled by Spark,
so Pregel will automatically coordinate the computation, communication, and
synchronization needed to run for the specified number of supersteps (or
until there are no messages sent in a particular superstep).
Ankur
Hi Stuti,
Yes, you do need to install R on all nodes. Furthermore the rJava
library is also required, which can be installed simply using
'install.packages(rJava)' in the R shell. Some more installation
instructions after that step can be found in the README here:
I want to store JavaRDD as a sequence file instead of textfile. But i don't
see any Java API for that. Is there a way for this? Please let me know.
Thanks!
--
View this message in context:
Once you have generated the final RDD before submitting it to reducer try to
repartition the RDD either using coalesce(partitions) or repartition() into
known partitions. 2. Rule of thumb to create number of data partitions (3 *
num_executors * cores_per_executor).
--
View this message in
Unfortunately, I couldn’t figure it out without involving Avro.
Here is something that may be useful since it uses Avro generic records (so no
case classes needed) and transforms to Parquet.
http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
HTH,
Mahesh
From:
Can you use saveAsObjectFile?
On Thu, Jun 19, 2014 at 5:54 PM, abhiguruvayya sharath.abhis...@gmail.com
wrote:
I want to store JavaRDD as a sequence file instead of textfile. But i don't
see any Java API for that. Is there a way for this? Please let me know.
Thanks!
--
View this message
No. My understanding by reading the code is that RDD.saveAsObjectFile uses
Java Serialization and RDD.saveAsSequenceFile uses Writable which is tied to
the Writable Serialization framework in HDFS.
--
View this message in context:
Thanks Mahesh,
I came across this example, look like it might give us some directions.
https://github.com/Parquet/parquet-mr/tree/master/parquet-hadoop/src/main/java/parquet/hadoop/example
Thanks
Anita
On 20 June 2014 09:03, maheshtwc mahesh.padmanab...@twc-contractor.com
wrote:
I'm not sure if this is a Hadoop-centric issue or not. I had similar issues
with non-serializable external library classes.
I used a Kryo config (as illustrated here
https://spark.apache.org/docs/latest/tuning.html#data-serialization ) and
registered the one troublesome class. It seemed to work
I'm trying to use Spark (Java) for an optimization algorithm that needs
repeated server-node exchanges of information. (The ADMM algorithm for
whoever is familiar). In each iteration, I need to update a set of values on
the nodes, and collect them on the server, which will update it's own set of
We use maven for building our code and then invoke spark-submit through the
exec plugin, passing in our parameters. Works well for us.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Jun 20, 2014 at 3:26 AM, Michael Cutler
74 matches
Mail list logo