Hi,
there are apparently helpers to tell you the offsets
https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example#id-0.8.0SimpleConsumerExample-FindingStartingOffsetforReads,
but I have no idea how to pass that to the Kafka stream consumer. I am
interested in that as well.
Hi guys,
We are making choices between C++ MPI and Spark. Is there any official
comparation between them? Thanks a lot!
Wei
We are creating a real-time stream processing system with spark streaming
which uses large number (millions) of analytic models applied to RDDs in
the many different type of streams. Since we do not know which spark node
will process specific RDDs , we need to make these models available at each
My transformations or actions has some external tool set dependencies and
sometimes they just stuck somewhere and there is no way I can fix them. If I
don't want the job to run forever, Do I need to implement several monitor
threads to throws an exception when they stuck. Or the framework can
Hi, I have the same exception. Can you tell me how did you fix it? Thank you!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p7665.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hello Wei,
I talk from experience of writing many HPC distributed application using
Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel
Virtual Machine (PVM) way before that back in the 90's. I can say with
absolute certainty:
*Any gains you believe there are because C++ is
Hi,
I'm trying to use Accumulo with Spark by writing to AccumuloOutputFormat.
It went all well on my laptop (Accumulo MockInstance + Spark Local mode).
But when I try to submit it to the yarn cluster, the yarn logs shows the
following error message:
14/06/16 02:01:44 INFO
Hi
Check in your driver programs Environment, (eg:
http://192.168.1.39:4040/environment/). If you don't see this
commons-codec-1.7.jar jar then that's the issue.
Thanks
Best Regards
On Mon, Jun 16, 2014 at 5:07 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I'm trying to use Accumulo
If you don't want to refactor your code, you can put your input into a test
file. After the test runs, read the data from the output file you specified
(probably want this to be a temp file and delete on exit). Of course, that
is not really a unit test - Metei's suggestion is preferable (this is
Thanks you all for advice including (1) using CMS GC (2) use multiple
worker instance and (3) use Tachyon.
I will try (1) and (2) first and report back what I found.
I will also try JDK 7 with G1 GC.
Best regards,
Wei
-
Wei Tan, PhD
Research Staff Member
IBM T.
BTW: nowadays a single machine with huge RAM (200G to 1T) is really
common. With virtualization you lose some performance. It would be ideal
to see some best practice on how to use Spark in these state-of-art
machines...
Best regards,
Wei
-
Wei Tan, PhD
Hi all,
I am testing the regression methods (SGD) using pyspark. Tried to tune the
parameters, but they are far off from the results obtained using R. Is there
some way to set these parameters more efficiently?
thanks,
--
View this message in context:
forgot to mention that I'm running spark 1.0
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p7673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
I've been doing some testing with Calliope as a way to do batch load from
Spark into Cassandra.
My initial results are promising on the performance area, but worrisome on
the memory footprint side.
I'm generating N records of about 50 bytes each and using the UPDATE
mutator to insert them
Hi Xiangrui,
Thank you for the reply! I have tried customizing
LogisticRegressionSGD.optimizer as in the example you mentioned, but the source
code reveals that the intercept is also penalized if one is included, which is
usually inappropriate. The developer should fix this problem.
Best,
Like many people, I'm trying to do hourly counts. The twist is that I don't
want to count per hour of streaming, but per hour of the actual occurrence
of the event (wall clock, say -mm-dd HH).
My thought is to make the streaming window large enough that a full hour of
streaming data would fit
Hey
I am new to spark streaming and apologize if these questions have been
asked.
* In StreamingContext, reduceByKey() seems to only work on the RDDs of the
current batch interval, not including RDDs of previous batches. Is my
understanding correct?
* If the above statement is correct, what
I'm playing with a modified version of the TwitterPopularTags example and
when I tried to submit the job to my cluster, workers keep dying with this
message:
14/06/16 17:11:16 INFO DriverRunner: Launch Command: java -cp
Is your data normalized? Sometimes, GD doesn't work well if the data
has wide range. If you are willing to write scala code, you can try
LBFGS optimizer which converges better than GD.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Hello Spark Streaming Experts
I have a use-case, where I have a bunch of log-entries coming in, say every
10 seconds (Batch-interval). I create a JavaPairDStream[K,V] from these
log-entries. Now, there are two things I want to do with this
JavaPairDStream:
1. Use key-dependent state (updated by
It’s true that it can’t. You can try to use the CloudPickle library instead,
which is what we use within PySpark to serialize functions (see
python/pyspark/cloudpickle.py). However I’m also curious, why do you need an
RDD of functions?
Matei
On Jun 15, 2014, at 4:49 PM, madeleine
Someone is working on weighted regularization. Stay tuned. -Xiangrui
On Mon, Jun 16, 2014 at 9:36 AM, FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
fixed-term.congrui...@us.bosch.com wrote:
Hi Xiangrui,
Thank you for the reply! I have tried customizing
LogisticRegressionSGD.optimizer as in the
Hi Congrui,
We're working on weighted regularization, so for intercept, you can
just set it as 0. It's also useful when the data is normalized but
want to solve the regularization with original data.
Sincerely,
DB Tsai
---
My Blog:
Thank you! I'm really looking forward to that.
Best,
Congrui
-Original Message-
From: Xiangrui Meng [mailto:men...@gmail.com]
Sent: Monday, June 16, 2014 11:19 AM
To: user@spark.apache.org
Subject: Re: MLlib-Missing Regularization Parameter and Intercept for Logistic
Regression
Hi Congrui,
I mean create your own TrainMLOR.scala with all the code provided in
the example, and have it under package org.apache.spark.mllib
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
Hi DB,
Thank you for the reply! I'm looking forward to this change, which surely adds
much more flexibility to the optimizers, including whether or not the intercept
should be penalized.
Sincerely,
Congrui Yi
From: DB Tsai-2 [via Apache Spark User List]
I guess you have to understand the difference of architecture. I don't know
much about C++ MPI but it is basically MPI whereas Spark is inspired from
Hadoop MapReduce and optimised for reading/writing large amount of data
with a smart caching and locality strategy. Intuitively, if you have a high
Thank you! I'll try it out.
From: DB Tsai-2 [via Apache Spark User List]
[mailto:ml-node+s1001560n7686...@n3.nabble.com]
Sent: Monday, June 16, 2014 11:32 AM
To: FIXED-TERM Yi Congrui (CR/RTC1.3-NA)
Subject: Re: MLlib-a problem of example code for L-BFGS
Hi Congrui,
I mean create your own
Did you manage to make it work? I'm facing similar problems and this a
serious blocker issue. spark-submit seems kind of broken to me if you can
use it for spark-streaming.
Regards,
Luis
2014-06-11 1:48 GMT+01:00 lannyripple lanny.rip...@gmail.com:
I am using Spark 1.0.0 compiled with Hadoop
Hi All,
I am just trying to compare Scala and Python API in my local machine. Just
tried to import a local matrix(1000 by 10, created in R) stored in a text
file via textFile in pyspark. when I run data.first() it fails to present
the line and give error messages including the next:
Then I did
Hi everyone,
The Python LogisticRegressionWithSGD does not appear to estimate an
intercept. When I run the following, the returned weights and intercept
are both 0.0:
from pyspark import SparkContext
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import
We are having the same problem. We're running Spark 0.9.1 in standalone
mode and on some heavy jobs workers become unresponsive and marked by
master as dead, even though the worker process is still running. Then they
never join the cluster again and cluster becomes essentially unusable until
we
Interesting! I'm curious why you use cloudpickle internally, but then use
standard pickle to serialize RDDs?
I'd like to create an RDD of functions because (I think) it's the most
natural way to express my problem. I have a matrix of functions; I'm trying
to find a low rank matrix that minimizes
Hi,
I'm a new user to Spark and I'm trying to integrate it in my cluster.
It's a small set of nodes running CDH 4.7 with kerberos.
The other services are fine with the authentication but I've some troubles
with spark.
First, I used the parcel available in cloudera manager (SPARK
Hi,
I have a Spark method that returns RDD[String], which I am converting to a
set and then comparing it to the expected output as shown in the following
code.
1. val expected_res = Set(ID1, ID2, ID3) // expected output
2. val result:RDD[String] = getData(input) //method returns RDD[String]
Are you referring to accessing a SparkUI for an application that has
finished? First you need to enable event logging while the application is
still running. In Spark 1.0, you set this by adding a line to
$SPARK_HOME/conf/spark-defaults.conf:
spark.eventLog.enabled true
Other than that, the
On Mon, Jun 16, 2014 at 10:09 PM, SK skrishna...@gmail.com wrote:
The value returned by the method is almost same as expected output, but the
verification is failing. I am not sure why the expected_res in Line 5 does
not print the quotes even though Line 1 has them. Could that be the reason
In Line 1, I have expected_res as a set of strings with quotes. So I thought
it would include the quotes during comparison.
Anyway I modified expected_res = Set(\ID1\, \ID2\, \ID3\) and
that seems to work.
thanks.
--
View this message in context:
Ah, I see, interesting. CloudPickle is slower than the cPickle library, so
that’s why we didn’t use it for data, but it should be possible to write a
Serializer that uses it. Another thing you can do for this use case though is
to define a class that represents your functions:
class
With the help from the Accumulo guys, I probably know why.
I'm using the binary distro of Spark and Base64 is from spark-assembly.jar
and it probably uses an older version of commons-codec.
I'll need to reinstall spark from source.
Jianshi
On Mon, Jun 16, 2014 at 9:18 PM, Akhil Das
Hi Folks,
I am having trouble getting spark driver running in docker. If I run a
pyspark example on my mac it works but the same example on a docker image
(Via boot2docker) fails with following logs. I am pointing the spark driver
(which is running the example) to a spark cluster (driver is not
Spark gives you four of the classical collectives: broadcast, reduce,
scatter, and gather. There are also a few additional primitives, mostly
based on a join. Spark is certainly less optimized than MPI for these, but
maybe that isn't such a big deal. Spark has one theoretical disadvantage
Hi,
my hive configuration use db2 as it's metastore database, I have built
spark with the extra step sbt/sbt assembly/assembly to include the
dependency jars. and copied HIVE_HOME/conf/hive-site.xml under spark/conf.
when I ran :
hql(CREATE TABLE IF NOT EXISTS src (key INT, value STRING))
got
+1 for this issue. Documentation for spark-submit are misleading. Among many
issues, the jar support is bad. HTTP urls do not work. This is because spark is
using hadoop's FileSystem class. You have to specify the jars twice to get
things to work. Once for the DriverWrapper to laid your classes
If you want string with quotes, you have to escape it with '\'. It's exactly
what you did in the modified version.
Sent from my iPhone
On 2014年6月17日, at 5:43, SK skrishna...@gmail.com wrote:
In Line 1, I have expected_res as a set of strings with quotes. So I thought
it would include the
Hi all,
I have run into a very interesting bug which is not exactly as same as
Spark-1112.
Here is how to reproduce the bug, I have one input csv file and use
partitionBy function to create an RDD, say repartitionedRDD. The
partitionBy function takes the number of partitions as a parameter
such
Hi,
are you using the amplab/spark-1.0.0 images from the global registry?
Andre
On 06/17/2014 01:36 AM, Mohit Jaggi wrote:
Hi Folks,
I am having trouble getting spark driver running in docker. If I run a
pyspark example on my mac it works but the same example on a docker image
(Via
47 matches
Mail list logo