Hi,
Can we use Storm Streaming as RDD in Spark? Or any way to get Spark work
with Storm?
Thanks
Ajay
That's because in your code at some place you have specified localhost
instead of the ip address of the machine running the service. When run it in
local mode it will work fine because everything happens on that machine and
hence it will be able to connect to localhost which runs the service, now
Hi all:I wonder if is there a way to export data from table of hive into hdfs
using spark?like this: INSERT OVERWRITE DIRECTORY '/user/linqili/tmp/src'
select * from $DB.$tableName
Hi,
I'm not sure what you are asking:
Whether we can use spouts and bolts in Spark (= no)
or whether we can do streaming in Spark:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
-kr, Gerard.
On Tue, Dec 23, 2014 at 9:03 AM, Ajay ajay.ga...@gmail.com wrote:
Hi,
Can
Hi,
The question is to do streaming in Spark with Storm (not using Spark
Streaming).
The idea is to use Spark as a in-memory computation engine and static data
coming from Cassandra/Hbase and streaming data from Storm.
Thanks
Ajay
On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas
(In your libsvm example, your indices are not ascending.)
The first weight corresponds to the first feature, of course. An
indexing scheme doesn't change that or somehow make the first feature
map to the second (where would the last one go then?). You'll find the
first weight at offset 0 in an
Hi,
I have a csv file having fields as a,b,c .
I want to do aggregation(sum,average..) based on any field(a,b or c) as per
user input,
using Apache Spark Java API,Please Help Urgent!
Thanks in advance,
Regards
Sachin
--
View this message in context:
After checking the spark code, I now realize that an rdd that was cached to
disk can't be evicted, so I will just persist the rdd to disk after the
random numbers are created.
--
View this message in context:
I'm not aware of a project trying to achieve this integration. At some
point Summingbird had the intention of adding an Spark port, and that could
potentially bridge Storm and Spark. Not sure if that evolved into something
concrete.
In any case, an attempt to bring Storm and Spark together will
Right. I contacted the SummingBird users as well. It doesn't support Spark
streaming currently.
We are heading towards Storm as it is mostly widely used. Is Spark
streaming production ready?
Thanks
Ajay
On Tue, Dec 23, 2014 at 3:47 PM, Gerard Maas gerard.m...@gmail.com wrote:
I'm not aware
Is anybody experiencing this? It looks like a bug in JetS3t to me, but
thought I'd sanity check before filing an issue.
I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
(s3://fake-test/1234).
The code does write to S3, but with double forward slashes
ᐧ
I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
to be deprecated anyway in favor of s3n.
Also the comment section notes that Amazon has implemented an EmrFileSystem
for S3 which is built using AWS SDK rather than JetS3t.
On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji
I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github,
installed maven and trying to compile system:
git clone https://github.com/apache/spark.git
git checkout v1.2.0
mvn -DskipTests clean package
leads to OutOfMemoryException. What amount of memory does it requires?
You might try a little more. The official guidance suggests 2GB:
https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
On Tue, Dec 23, 2014 at 2:57 PM, Vladimir Protsenko
protsenk...@gmail.com wrote:
I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded
This depends on which output format you want. For Parquet, you can
simply do this:
|hiveContext.table(some_db.some_table).saveAsParquetFile(hdfs://path/to/file)
|
On 12/23/14 5:22 PM, LinQili wrote:
Hi Leo:
Thanks for your reply.
I am talking about using hive from spark to export data from
Hi Vladimir,
From the link Sean posted, if you use Java 8 there is this following note.
Note: For Java 8 and above this step is not required.
So if you have no problems using Java 8, give it a shot.
Best Regards,Guru Medasani
From: so...@cloudera.com
Date: Tue, 23 Dec 2014 15:04:42 +
The text there is actually unclear. In Java 8, you still need to set
the max heap size (-Xmx2g). The optional bit is the
-XX:MaxPermSize=512M actually. Java 8 no longer has a separate
permanent generation.
On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani gdm...@outlook.com wrote:
Hi Vladimir,
Thanks for the clarification Sean.
Best Regards,Guru Medasani
From: so...@cloudera.com
Date: Tue, 23 Dec 2014 15:39:59 +
Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
To: gdm...@outlook.com
CC: protsenk...@gmail.com; user@spark.apache.org
The text there is
Doh...figured it out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hello folks,
I'm trying to deploy a Spark driver on Amazon EMR in yarn-cluster mode
expecting to be able to access the Spark UI from the spark-master-ip:4040
address (default port). The problem here is that the Spark UI port is
always defined randomly at runtime, although I also tried to specify
update:
t1 is good. After collecting on t1, I find that all row is ok (is_new = 0)
Just after sampling, there are some rows where is_new = 1 which should have
been filtered by Where clause.
--
View this message in context:
hi dears!
Is there some efficient way to drop first line of an RDD[String]?
any suggestion?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834.html
Sent from the Apache Spark User List mailing list
Hi,
maybe the drop function is helpful for you (even though this is probably
more than you need, still interesting read)
http://erikerlandson.github.io/blog/2014/07/27/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
Joerg
On Tue, Dec 23, 2014 at 5:45 PM, Hao Ren
Hi there,
We are using mllib 1.1.1, and doing Logistics Regression with a dataset of
about 150M rows.
The training part usually goes pretty smoothly without any retries. But
during the prediction stage and BinaryClassificationMetrics stage, I am
seeing retries with error of fetch failure.
The
There is also a lazy implementation:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/
I generated a PR for it -- there was also an alternate proposal for having it
be a library in the new Spark Packages site:
I've been attempting to run a job based on MLlib's ALS implementation for a
while now and have hit an issue I'm having a lot of difficulty getting to
the bottom of.
On a moderate size set of input data it works fine, but against larger
(still well short of what I'd think of as big) sets of data,
Hi Gerard,
SPARK-4286 is the ticket I am working on, which besides supporting shuffle
service it also supports the executor scaling callbacks (kill/request
total) for coarse grain mode.
I created SPARK-4940 to discuss more about the distribution problem, and
let's bring our discussions there.
Hafiz,
You can probably use the RDD.mapPartitionsWithIndex method.
Mike
On Tue, Dec 23, 2014 at 8:35 AM, Hafiz Mujadid [via Apache Spark User List]
ml-node+s1001560n20834...@n3.nabble.com wrote:
hi dears!
Is there some efficient way to drop first line of an RDD[String]?
any suggestion?
We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can check the user guide:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering
-Xiangrui
On Tue,
Sean's PR may be relevant to this issue
(https://github.com/apache/spark/pull/3702). As a workaround, you can
try to truncate the raw scores to 4 digits (e.g., 0.5643215 - 0.5643)
before sending it to BinaryClassificationMetrics. This may not work
well if he score distribution is very skewed. See
yep Michael Quinlan,it's working as suggested by Hoe Ren
thansk to you and Hoe Ren
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834p20840.html
Sent from the Apache Spark User List mailing list archive at
Hi all,
I'm trying to lock down ALL Spark ports and have tried using
spark-defaults.conf and via the sparkContext. (The example below was run in
local[*] mode, but all attempts to run in local or spark-submit.sh on
cluster via jar all result in the same results).
My goal is to define all
Hey Xiangrui,
Is there any plan to have a streaming compatible ALS version?
Or if it's currently doable, is there any example?
On Tue, Dec 23, 2014 at 4:31 PM, Xiangrui Meng men...@gmail.com wrote:
We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can
Xiangrui: Hi, I want to using this with streaming and with job too. I using
kafka (streaming) and elasticsearch (job) as source and want to calculate
sentiment value from the input text.
Simon: great, you have any doc how can I embed into my application without
using the http interface? (how can I
Yes, my change is slightly downstream of this point in the processing
though. The code is still creating a counter for each distinct score
value, and then binning. I don't think that would cause a failure -
just might be slow. At the extremes, you might see 'fetch failure' as
a symptom of things
Silly question, what is the best way to shuffle protobuf messages in Spark
(Streaming) job? Can I use Kryo on top of protobuf Message type?
--
Chen Song
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems
to work much better. Can't find the link ATM, but seems I recall that
s3:// (Hadoop's original block format for s3) is no longer recommended for
use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the
Have a look at RDD.groupBy(...) and reduceByKey(...)
On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh sachin.sha...@gmail.com
wrote:
Hi,
I have a csv file having fields as a,b,c .
I want to do aggregation(sum,average..) based on any field(a,b or c) as per
user input,
using Apache Spark Java
Hi spark users,
I'm trying to create external table using HiveContext after creating a
schemaRDD and saving the RDD into a parquet file on hdfs.
I would like to use the schema in the schemaRDD (rdd_table) when I create
the external table.
For example:
http://www.jets3t.org/toolkit/configuration.html
Put the following properties in a file named jets3t.properties and make
sure it is available during the running of your Spark job (just place it in
~/ and pass a reference to it when calling spark-submit with --file
~/jets3t.properties)
Turns out I was using the s3:// prefix (in a standalone Spark cluster). It
was writing a LOT of block_* files to my S3 bucket, which was the cause for
the slowness. I was coming from Amazon EMR, where Amazon's underlying FS
implementation has re-mapped s3:// to s3n://, which doesn't use the
Hi there
We are on mllib 1.1.1, and trying different regularization parameters. We
noticed that the regParam dont affect the weights at all. Is setting the
reg param via the optimizer the right thing to do? Do we need to set our
own updater? Anyone else seeing the same behaviour?
thanks again
Hi All,
We are getting exception after we added one RDD to another RDD.
We first declared an empty RDD A, then received new Dstream B from
Kafka; for each RDD in the Dstream B, we kept adding them to the existing
RDD A. Error happened when we were trying to use the updated RDD A.
Could anybody
Hi All ,
It seems problem is little more complicated.
If the job is hungup on reading s3 file.even if I kill the unix process that
started the job, it is not killing spark-job. It is still hung up there.
Now the questions are :
How do I find spark-job based on the name?
How do I kill the
Here is a more cleaned up version, can be used in |./sbt/sbt
hive/console| to easily reproduce this issue:
|sql(SELECT * FROM src WHERE key % 2 = 0).
sample(withReplacement =false, fraction =0.05).
registerTempTable(sampled)
println(table(sampled).queryExecution)
val query = sql(SELECT
Hi Phil,
This sounds a lot like a deadlock in Hadoop's Configuration object that I
ran into a while back. If you jstack the JVM and see a thread that looks
like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546
Executor task launch worker-6 daemon prio=10
You should be able to kill the job using the webUI or via spark-class.
More info can be found in the thread:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html.
HTH!
On Tue, Dec 23, 2014 at 4:47 PM, durga durgak...@gmail.com wrote:
Could you please provide a complete stacktrace? Also it would be good if
you can share your hive-site.xml as well.
On 12/23/14 4:42 PM, Dai, Kevin wrote:
Hi, there
When I use hive udf from_unixtime with the HiveContext, the job block
and the log is as follow:
sun.misc.Unsafe.park(Native
I am using Eclipse to develop a Spark application (using Spark 1.1.0). I use
the ScalaTest framework to test the application. But I was blocked by
the following exception:
java.lang.SecurityException: class javax.servlet.FilterRegistration's
signer
information does not match signer information
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira
issue for this:
https://issues.apache.org/jira/browse/SPARK-4944
Hope come with a solution soon.
Cheng Hao
From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, December 24, 2014 4:26 AM
To:
It's working now. Probably I didn't specify the excluded list correctly. I
kept revising it and now it's working. Thanks.
Ey-Chih Chow
--
View this message in context:
Thanks All.
Finally the works code is below:
object PlayRecord {
def getUserActions(accounts: RDD[String], idType: Int, timeStart: Long,
timeStop: Long, cacheSize: Int,
filterSongDays: Int, filterPlaylistDays: Int):
RDD[(String, (Int, Set[Long], Set[Long]))] = {
Hi,
On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab as...@live.com wrote:
val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {
val result = session.SomeOp(entry)
SomeOutput(entry.Key, result.SomeProp)
}
I could use a transformation for rdd.map, but in case of failures,
Hi,
The official pom.xml file only have profile for hadoop version 2.4 as the
latest version, but I installed hadoop version 2.6.0 with ambari, how can I
build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a
coresponding profile for it?
Regards,
Xiaobo
See http://search-hadoop.com/m/JW1q5Cew0j
On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 guxiaobo1...@qq.com wrote:
Hi,
The official pom.xml file only have profile for hadoop version 2.4 as the
latest version, but I installed hadoop version 2.6.0 with ambari, how can I
build spark against it,
Hi Josh,
If you want to cap the amount of memory per executor in Coarse grain mode,
then yes you only get 240GB of memory as you mentioned. What's the reason
you don't want to raise the capacity of memory you use per executor?
In coarse grain mode the Spark executor is long living and it
I am trying to use sql over Spark streaming using Java. But i am getting
Serialization Exception.
public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName(NumberCount);
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc =
Hi dears!
I want to convert a schemaRDD into RDD of String. How can we do that?
Currently I am doing like this which is not converting correctly no
exception but resultant strings are empty
here is my code
def SchemaRDDToRDD( schemaRDD : SchemaRDD ) : RDD[ String ] = {
var
i've eliminated fetch failed with this parameters (don't know which was
the right one for the problem)
to the spark-submit running with 1.2.0
--conf spark.shuffle.compress=false \
--conf spark.file.transferTo=false \
--conf spark.shuffle.manager=hash \
--conf
FYI,
Latest hive 0.14/parquet will have column renaming support.
Jianshi
On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust mich...@databricks.com
wrote:
You might also try out the recently added support for views.
On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
I have a existing batch processing system which use SQL queries to extract
information from data. I want to replace this with Real time system.
I am coding in Java and to use SQL in Streaming data i found few examples
but none of them is complete.
Here's the java api docs
https://spark.apache.org/docs/latest/api/java/index.html
You can start with this example and convert it into Java (its pretty
straight forward)
http://stackoverflow.com/questions/25484879/sql-over-spark-streaming
Eg:
In Scala :
val sparkConf = new
62 matches
Mail list logo