Hi there,
We have released our real-time aggregation engine based on Spark Streaming.
SPARKTA is fully open source (Apache2)
You can checkout the slides showed up at the Strata past week:
http://www.slideshare.net/Stratio/strata-sparkta
Source code:
https://github.com/Stratio/sparkta
And
Thanks for the reply.
I have not tried it out (I will today and report on my results) but I think
what I need to do is to call mapPartitions and pass it a function that sets the
seed. I was planning to pass the seed value in the closure.
Something like:
my_seed = 42
def f(iterator):
Here is an example of how I would pass in the S3 parameters to hadoop
configuration in pyspark.
You can do something similar for other parameters you want to pass to the
hadoop configuration
hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.set(fs.s3.impl,
What have you tried so far?
Maybe, the easiest way is using a collection and reduce them adding its
values.
JavaPairRDDString, String pairRDD = sc.parallelizePairs(data);
JavaPairRDDString, ListInteger result = pairRDD.mapToPair(new
Functions.createList())
.mapToPair(new
Delete from table is available as part of Hive 0.14 (reference: Apache Hive
Language Manual DML - Delete
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Delete)
while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive
0.14 or generate a new
Yes it is repeatedly on my locally Jenkins.
发自我的 iPhone
在 2015年5月14日,18:30,Tathagata Das
t...@databricks.commailto:t...@databricks.com 写道:
Do you get this failure repeatedly?
On Thu, May 14, 2015 at 12:55 AM, kf
wangf...@huawei.commailto:wangf...@huawei.com wrote:
Hi, all, i got following
it still does’t work…
the streamingcontext could detect the new file, but it shows:
ERROR dstream.FileInputDStream: File
hdfs://nameservice1/sandbox/hdfs/list_join_action/2015_05_14_20_stream_1431605640.lz4
has no data in it. Spark Streaming can only ingest files that have been
moved to the
Hi,
I have JavaPairRDDString, String and I want to implement reduceByKey
method.
My pairRDD :
*2553: 0,0,0,1,0,0,0,0*
46551: 0,1,0,0,0,0,0,0
266: 0,1,0,0,0,0,0,0
*2553: 0,0,0,0,0,1,0,0*
*225546: 0,0,0,0,0,1,0,0*
*225546: 0,0,0,0,0,1,0,0*
I want to get :
*2553: 0,0,0,1,0,1,0,0*
46551:
thanks, Wilfred.
In our program, the htrace-core-3.1.0-incubating.jar dependency is only
required in the executor, not in the driver.
while in both yarn-client and yarn-cluster, the executor runs in
cluster.
and it's clearly in yarn-cluster mode, the jar IS in
spark.yarn.secondary.jars,
but
I see that the pre-built distributions includes hive-shims-0.23 shaded in
spark-assembly jar (unlike when I make the distribution myself).
Does anyone knows what I should do to include the shims in my distribution?
On Thu, May 14, 2015 at 9:52 AM, Lior Chaga lio...@taboola.com wrote:
...This is madness!
On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
Hi there,
We have released our real-time aggregation engine based on Spark Streaming.
SPARKTA is fully open source (Apache2)
You can checkout the slides showed up at the Strata past week:
Yea, I wouldn't try and modify the current since RDDs are suppose to be
immutable, just create a new one...
val newRdd = oldRdd.map(r = (r._2(), r._1()))
or something of that nature...
Steve
From: Evo Eftimov [evo.efti...@isecc.com]
Sent: Thursday, May 14, 2015
The list of unsupported hive features should mention that it implicitly
includes features added after Hive 13. You cannot yet compile with Hive
13, though we are investigating this for 1.5
On Thu, May 14, 2015 at 6:40 AM, Denny Lee denny.g@gmail.com wrote:
Delete from table is available
Where is the “Tuple” supposed to be in String, String - you can refer to a
“Tuple” if it was e.g. String, Tuple2String, String
From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of
Holden Karau
Sent: Thursday, May 14, 2015 5:56 PM
To: Yasemin Kaya
Cc:
You can configure Spark SQLs hive interaction by placing a hive-site.xml
file in the conf/ directory.
On Thu, May 14, 2015 at 10:24 AM, jamborta jambo...@gmail.com wrote:
Hi all,
is it possible to set hive.metastore.warehouse.dir, that is internally
create by spark, to be stored externally
I solved my problem right this way.
JavaPairRDDString, String swappedPair = pair.mapToPair(
new PairFunctionTuple2String, String, String, String() {
@Override
public Tuple2String, String call(
Tuple2String, String item)
throws Exception {
return item.swap();
}
});
2015-05-14 20:42 GMT+03:00
Can you paste your code? transformations return a new RDD rather than
modifying an existing one, so if you were to swap the values of the tuple
using a map you would get back a new RDD and then you would want to try and
print this new RDD instead of the original one.
On Thursday, May 14, 2015,
Hi all,
is it possible to set hive.metastore.warehouse.dir, that is internally
create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)?
thanks,
--
View this message in context:
End of the month is the target:
https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
On Thu, May 14, 2015 at 3:45 AM, Ishwardeep Singh
ishwardeep.si...@impetus.co.in wrote:
Hi Michael Ayan,
Thank you for your response to my problem.
Michael do we have a tentative release
Here is a python code, I am sure you'd get the drift. Basically you need to
implement 2 functions: seq and comb in order to partial and final
operations.
def addtup(t1,t2):
j=()
for k,v in enumerate(t1):
j=j+(t1[k]+t2[k],)
return j
def seq(tIntrm,tNext):
return
Super, it worked. Thanks
On Fri, May 15, 2015 at 12:26 AM, Ram Sriharsha sriharsha@gmail.com
wrote:
Here is an example of how I would pass in the S3 parameters to hadoop
configuration in pyspark.
You can do something similar for other parameters you want to pass to the
hadoop
Nice Job!
we are developing something very similar... I will contact you to understand if
we can contribute to you with some piece !
Best
Paolo
Da: Evo Eftimovmailto:evo.efti...@isecc.com
Data invio: ?gioved?? ?14? ?maggio? ?2015 ?17?:?21
A: 'David Morales'mailto:dmora...@stratio.com, Matei
I do not intend to provide comments on the actual “product” since my time is
engaged elsewhere
My comments were on the “process” for commenting which looked as
self-indulgent, self patting on the back communication (between members of the
party and its party leader) – that bs used to be
We put a lot of work in sparkta and it is awesome to hear from both the
community and relevant people. Just as easy as that.
I hope you have time to consider the project, which is our main concern at
this moment, and hear from you too.
2015-05-14 17:46 GMT+02:00 Evo Eftimov
(Sorry, for non-English people: that means it's a good thing.)
Matei
On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com wrote:
...This is madness!
On May 14, 2015, at 9:31 AM, dmoralesdf dmora...@stratio.com wrote:
Hi there,
We have released our real-time
Thank you Paolo. Don't hesitate to contact us.
Evo, we will be glad to hear from you and we are happy to see some kind of
fast feedback from the main thought leader of spark, for sure.
2015-05-14 17:24 GMT+02:00 Paolo Platter paolo.plat...@agilelab.it:
Nice Job!
we are developing
Thanks for your kind words Matei, happy to see that our work is in the
right way.
2015-05-14 17:10 GMT+02:00 Matei Zaharia matei.zaha...@gmail.com:
(Sorry, for non-English people: that means it's a good thing.)
Matei
On May 14, 2015, at 10:53 AM, Matei Zaharia matei.zaha...@gmail.com
Hi,
Is it possible to setup streams from multiple Kinesis streams and process
them in a single job? From what I have read, this should be possible,
however, the Kinesis layer errors out whenever I try to receive from more
than a single Kinesis Stream.
Here is the code. Currently, I am focused
Hi all,
I'm a complete newbie to spark and spark streaming, so the question may seem
obvious, sorry for that.
It is okay to store Seq[Data] in state when using 'updateStateByKey'? I have
a function with signature
def saveState(values: Seq[Msg], value: Option[Iterable[Msg]]):
Option[Iterable[Msg]]
That has been a really rapid “evaluation” of the “work” and its “direction”
From: David Morales [mailto:dmora...@stratio.com]
Sent: Thursday, May 14, 2015 4:12 PM
To: Matei Zaharia
Cc: user@spark.apache.org
Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming
Is it possible to configure each machine that Spark is using as a worker
individually? For instance, setting the maximum number of cores to use for each
machine individually, or the maximum memory, or other settings related to
workers? Or is there any other way to specify a per-machine capacity
Hi Team,
I have a hive partition table with partition column having spaces.
When I try to run any query, say a simple Select * from table_name, it
fails.
*Please note the same was working in spark 1.2.0, now I have upgraded to
1.3.1. Also there is no change in my application code base.*
If I
Hi,,
I want to run a definite number of iterations in Kmeans. There is a command
line argument to set maxIterations, but even if I set it to a number, Kmeans
runs until the centroids converge.
Is there a specific way to specify it in command line?
Also, I wanted to know if we can supply
Hello,
May I know if these is way to implement aggregate function for grouped data
in DataFrame? I dug into the doc but didn't find any apart from the UDF
functions which applies on a Row. Maybe I have missed something. Thanks.
Justin
--
View this message in context:
@TD How do I file a JIRA?
ᐧ
On Tue, May 12, 2015 at 2:06 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
I wonder that may be a bug in the Python API. Please file it as a JIRA
along with sample code to reproduce it and sample output you get.
On Tue, May 12, 2015 at 10:00 AM, Vadim
How does textFileStream work behind the scenes? How does Spark Streaming
know what files are new and need to be processed? Is it based on time
stamp, file name?
Thanks,
Vadim
ᐧ
I am trying to extract the *output data size* information for *each task*.
What *field(s)* should I look for, given the json-format log?
Also, what does Result Size stand for?
Thanks a lot in advance!
-Yanwei
--
View this message in context:
I have tried to put the hive-site.xml file in the conf/ directory with,
seems it is not picking up from there.
On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust mich...@databricks.com
wrote:
You can configure Spark SQLs hive interaction by placing a hive-site.xml
file in the conf/ directory.
What is the error you are seeing?
TD
On Thu, May 14, 2015 at 9:00 AM, Erich Ess er...@simplerelevance.com
wrote:
Hi,
Is it possible to setup streams from multiple Kinesis streams and process
them in a single job? From what I have read, this should be possible,
however, the Kinesis layer
Hi,
I have a text dataset which I want to apply a cluster algorithm from MLIB.
This data is (n,m) matrix with readers.
I would like to know if the team can help me on how load this data in Spark
Scala and separate the variable I want to cluster.
Thanks
Rick.
[Descrição: Descrição:
Hi,
I am trying to validate our modeling data pipeline by running
LogisticRegressionWithLBFGS on a dataset with ~3.7 million features,
basically to compute AUC. This is on Spark 1.3.0.
I am using 128 executors with 4 GB each + driver with 8 GB. The number of
data partitions is 3072
The
have you tried to union the 2 streams per the KinesisWordCountASL example
https://github.com/apache/spark/blob/branch-1.3/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L120
where
2 streams (against the same Kinesis stream in this case) are created
A possible problem may be that the kinesis stream in 1.3 uses the
SparkContext app name, as the Kinesis Application Name, that is used by the
Kinesis Client Library to save checkpoints in DynamoDB. Since both kinesis
DStreams are using the Kinesis application name (as they are in the same
*Join the Apache Spark community at the fourth Spark Summit in San
Francisco on June 15, 2015. At Spark Summit 2015 you will hear keynotes
from NASA, the CIA, Toyota, Databricks, AWS, Intel, MapR, IBM, Cloudera,
Hortonworks, Timeful, O'Reilly, and Andreessen Horowitz. 260 talks proposal
were
In my application, I want to start a DStream computation only after an
special event has happened (for example, I want to start the receiver
only after the reference data has been properly initialized).
My question is: it looks like the DStream will be started right after
the StreaminContext has
Thank you, should I open a JIRA for this issue?
From: Olivier Girardot [mailto:ssab...@gmail.com]
Sent: Tuesday, May 12, 2015 5:12 AM
To: Reynold Xin
Cc: Haopu Wang; user
Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
I'll look into it
Hi TD, regarding to the performance of updateStateByKey, do you have a
JIRA for that so we can watch it? Thank you!
From: Tathagata Das [mailto:t...@databricks.com]
Sent: Wednesday, April 15, 2015 8:09 AM
To: Krzysztof Zarzycki
Cc: user
Subject: Re: Is it
Greetings,
I have a relatively complex application with Spark, Jetty and Guava (16) not
fitting together.
Exception happens when some components try to use mix of Guava classes
(including Spark's pieces) that are loaded by different classloaders:
java.lang.LinkageError: loader constraint
What version of Spark are you using?
The bug you mention is only about the Optional class (and a handful of
others, but none of the classes you're having problems with). All other
Guava classes should be shaded since Spark 1.2, so you should be able to
use your own version of Guava with no
With this information it is hard to predict. What's the performance you are
getting? What's your desired performance? Maybe you can post your code and
experts can suggests improvement?
On 14 May 2015 15:02, sachin Singh sachin.sha...@gmail.com wrote:
Hi Friends,
please someone can give the
Hi Tathagata,
I think that's exactly what's happening.
The error message is:
com.amazonaws.services.kinesis.model.InvalidArgumentException:
StartingSequenceNumber
49550673839151225431779125105915140284622031848663416866 used in
GetShardIterator on shard shardId-0002 in stream erich-test
The problem is with 1.3.1
It has Function class (mentioned in exception) in
spark-network-common_2.10-1.3.1.jar.
Our current resolution is actually backport to 1.2.2, which is working fine.
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Thursday, May 14, 2015 6:27 PM
To: Anton Brazhnyk
Jo
Thanks for the reply, but _jsc does not have anything to pass hadoop
configs. can you illustrate your answer a bit more? TIA...
On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha sriharsha@gmail.com
wrote:
yes, the SparkContext in the Python API has a reference to the
JavaSparkContext
Hello Bright Sparks,
I was using Spark 1.3.0 to push data out to Parquet files. They have been
working great, super fast, easy way to persist data frames etc.
However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1.
I unzipped it, copied my config over and then went to read
After profiling with YourKit, I see there's an OutOfMemoryException in
context SQLContext.applySchema. Again, it's a very small RDD. Each executor
has 180GB RAM.
On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:
Hi,
Using spark sql with HiveContext. Spark version is 1.3.1
What do you mean by not detected? may be you forgot to trigger some action
on the stream to get it executed. Like:
val list_join_action_stream = ssc.fileStream[LongWritable, Text,
TextInputFormat](gc.input_dir, (t: Path) = true, false).map(_._2.toString)
*list_join_action_stream.count().print()*
The data pipeline (DAG) should not be added to the StreamingContext in the
case of a recovery scenario. The pipeline metadata is recovered from the
checkpoint folder. That is one thing you will need to fix in your code.
Also, I don't think the ssc.checkpoint(folder) call should be made in case
of
Ultimately it was PermGen out of memory. I somehow missed it in the log
On Thu, May 14, 2015 at 9:24 AM, Lior Chaga lio...@taboola.com wrote:
After profiling with YourKit, I see there's an OutOfMemoryException in
context SQLContext.applySchema. Again, it's a very small RDD. Each executor
has
Did you happened to have a look at the spark job server?
https://github.com/ooyala/spark-jobserver Someone wrote a python wrapper
https://github.com/wangqiang8511/spark_job_manager around it, give it a
try.
Thanks
Best Regards
On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW
Depends on which compile time you are talking about.
*scala compile time*: No, the information about which columns are available
is usually coming from a file or an external database which may or may not
be available to scalac.
*query compile time*: While your program is running, but before any
Can you share the client code that you used to send the data? May be this
discussion would give you some insights
http://apache-avro.679487.n3.nabble.com/Avro-RPC-Python-to-Java-isn-t-working-for-me-td4027454.html
Thanks
Best Regards
On Thu, May 14, 2015 at 8:44 AM, 鹰 980548...@qq.com wrote:
How do I unsubscribe from this mailing list please?
Thanks!!
Regards,
Saurabh Agrawal
Vice President
Markit
Green Boulevard
B-9A, Tower C
3rd Floor, Sector - 62,
Noida 201301, India
+91 120 611 8274 Office
This e-mail, including accompanying communications
Hi,
I have *JavaPairRDDString, String *and I want to *swap tuple._1() to
tuple._2()*. I use *tuple.swap() *but it can't be changed JavaPairRDD in
real. When I print JavaPairRDD, the values are same.
Anyone can help me for that?
Thank you.
Have nice day.
yasemin
--
hiç ender hiç
This change will be merged shortly for Spark 1.4, and has a minor
implication for those creating their own Spark builds:
https://issues.apache.org/jira/browse/SPARK-7249
https://github.com/apache/spark/pull/5786
The default Hadoop dependency has actually been Hadoop 2.2 for some
time, but the
It would be good if you can tell what I should add to the documentation to
make it easier to understand. I can update the docs for 1.4.0 release.
On Tue, May 12, 2015 at 9:52 AM, Lee McFadden splee...@gmail.com wrote:
Thanks for explaining Sean and Cody, this makes sense now. I'd like to
help
That's because you are using TextInputFormat i think, try
with LzoTextInputFormat like:
val list_join_action_stream = ssc.fileStream[LongWritable, Text,
com.hadoop.mapreduce.LzoTextInputFormat](gc.input_dir, (t: Path) = true,
false).map(_._2.toString)
Thanks
Best Regards
On Thu, May 14, 2015 at
Have a look https://spark.apache.org/community.html
Send an email to user-unsubscr...@spark.apache.org
Thanks
Best Regards
On Thu, May 14, 2015 at 1:08 PM, Saurabh Agrawal saurabh.agra...@markit.com
wrote:
How do I unsubscribe from this mailing list please?
Thanks!!
Regards,
Please see
http://spark.apache.org/community.html
Cheers
On May 14, 2015, at 12:38 AM, Saurabh Agrawal saurabh.agra...@markit.com
wrote:
How do I unsubscribe from this mailing list please?
Thanks!!
Regards,
Saurabh Agrawal
Vice President
Markit
Green Boulevard
B-9A,
Hi, all, i got following error when i run unit test of spark by dev/run-tests
on the latest branch-1.4 branch.
the latest commit id:
commit d518c0369fa412567855980c3f0f426cde5c190d
Author: zsxwing zsxw...@gmail.com
Date: Wed May 13 17:58:29 2015 -0700
error
[info] Test
Here's
https://github.com/twitter/hadoop-lzo/blob/master/src/main/java/com/hadoop/mapreduce/LzoTextInputFormat.java
the class. You can read more here
https://github.com/twitter/hadoop-lzo#maven-repository
Thanks
Best Regards
On Thu, May 14, 2015 at 1:22 PM, lisendong lisend...@163.com wrote:
hi keegan,
Thanks a lot. Now I know the column represents all the words without
repetition in all documents. I don't know what determine the order of the
words, is there any difference when the column words with the different
order? Thanks.
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Thanks everyone, that was the problem. the create new streaming
context function was supposed to setup the stream processing as well
as the checkpoint directory. I had missed the whole process of
checkpoint setup. With that done, everything works as
Hi Michael Ayan,
Thank you for your response to my problem.
Michael do we have a tentative release date for Spark version 1.4?
Regards,
Ishwardeep
From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, May 13, 2015 10:54 PM
To: ayan guha
Cc: Ishwardeep Singh; user
Subject:
Sorry for late reply.
Here is what I was thinking
import random as r
def main():
get SparkContext
#Just for fun, lets assume seed is an id
filename=bin.dat
seed = id(filename)
#broadcast it
br = sc.broadcast(seed)
#set up dummy list
lst = []
for i in range(4):
Hi guys i got to delete some data from a table by delete from table
where name = xxx, however delete is not functioning like the DML operation
in hive. I got a info like below:Usage: delete [FILE|JAR|ARCHIVE] value
[value]*
15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor:
Do you get this failure repeatedly?
On Thu, May 14, 2015 at 12:55 AM, kf wangf...@huawei.com wrote:
Hi, all, i got following error when i run unit test of spark by
dev/run-tests
on the latest branch-1.4 branch.
the latest commit id:
commit d518c0369fa412567855980c3f0f426cde5c190d
Hi all,
I'm experimenting with Spark's Word2Vec implementation for a relatively
large (5B words, vocabulary size 4M, 400-dimensional vectors) corpora. Has
anybody had success running it at this scale?
Thanks in advance for your guidance!
-Shilad
--
View this message in context:
Hi all,
We are planing to use SparkSQL in a DW system. There’s a question about the
caching mechanism of SparkSQL.
For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1,
T2 where T1.key=T2.key group by c1”).cache()
Is it going to cache the final result or the raw data
file timestamp
-- 原始邮件 --
发件人: Vadim Bichutskiy;vadim.bichuts...@gmail.com;
发送时间: 2015年5月15日(星期五) 凌晨4:55
收件人: user@spark.apache.orguser@spark.apache.org;
主题: textFileStream Question
How does textFileStream work behind the scenes? How does Spark Streaming
Hi all,
We are planing to use SparkSQL in a DW system. There’s a question about the
caching mechanism of SparkSQL.
For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1,
T2 where T1.key=T2.key group by c1”).cache()
Is it going to cache the final result or the raw data
another option (not really recommended, but worth mentioning) would be to
change the region of dynamodb to be separate from the other stream - and even
separate from the stream itself.
this isn't available right now, but will be in Spark 1.4.
On May 14, 2015, at 6:47 PM, Erich Ess
81 matches
Mail list logo