Hi,
I am interested in building an application that uses sliding windows not
based on the time when the item was received, but on either
* a timestamp embedded in the data, or
* a count (like: every 10 items, look at the last 100 items).
Also, I want to do this on stream data received from
Hello,
I've successfully built a very simple Spark Streaming application in Java
that is based on the HdfsCount example in Scala at
https://github.com/apache/spark/blob/branch-1.1/examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
.
When I submit this application to
Hi,
I know there are not so many conferences on Spark in Paris, so I just
wanted to let you know you that Ippon will be holding one on Thursday next
week (11th of December):
http://blog.ippon.fr/2014/12/03/ippevent-spark-ou-comment-traiter-des-donnees-a-la-vitesse-de-leclair/
There will be 3
Hi,
There have been some efforts going on in providing column level
encryption/decryption on hive tables.
https://issues.apache.org/jira/browse/HIVE-7934
Is there any plan to extend the functionality over sparksql also?
Thanks,
Chirag
Yeah, the dot notation works. It works even for arrays. But I am not sure
if it can handle complex hierarchies.
On Mon Dec 08 2014 at 11:55:19 AM Cheng Lian lian.cs@gmail.com wrote:
You may access it via something like SELECT filterIp.element FROM tb,
just like Hive. Or if you’re using
I am trying to create (yet another) spark as a service tool that lets you
submit jobs via REST APIs. I think I have nearly gotten it to work baring a
few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but
I have hit the road block with the following issue.
I have created a
spark can do efficient joins if both RDDs have the same partitioner. so in
case of self join I would recommend to create an rdd that has explicit
partitioner and has been cached.
On Dec 8, 2014 8:52 AM, Theodore Vasiloudis
theodoros.vasilou...@gmail.com wrote:
Hello all,
I am working on a
Hi Guys,
I used applySchema to store a set of nested dictionaries and lists in a
parquet file.
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sparkSQL-to-convert-a-collection-of-python-dictionary-of-dictionaries-to-schma-RDD-td20228.html#a20461
It was successful and i could
I am relatively new to Spark. I am planning to use Spark Streaming for my
OLAP use case, but I would like to know how RDDs are shared between multiple
workers.
If I need to constantly compute some stats on the streaming data, presumably
shared state would have to updated serially by different
As a tempary fix, it works when I convert field six to a list manually. That
is:
def generateRecords(line):
# input : the row stored in parquet file
# output : a python dictionary with all the key value pairs
field1 = line.field1
summary = {}
I do not see how you hope to generate all incoming edge pairs without
repartitioning the data by dstID. You need to perform this shuffle for
joining too. Otherwise two incoming edges could be in separate partitions
and never meet. Am I missing something?
On Mon, Dec 8, 2014 at 3:53 PM, Theodore
Check in your worker logs for exact reason, if you let the job run for 2
days then mostly this is because of you ran out of disk space or something.
Looking at the worker logs will give you a clear picture.
Thanks
Best Regards
On Mon, Dec 8, 2014 at 12:49 PM, Hafiz Mujadid
You can setup and customize nagios for all these requirements. Or you can
use Ganglia if you are not looking for something with alerts (email etc)
Thanks
Best Regards
On Mon, Dec 8, 2014 at 1:05 PM, Judy Nash judyn...@exchange.microsoft.com
wrote:
Hello,
Are there ways we can
How are you submitting the job? You need to create a jar of your code (sbt
package will give you one inside target/scala-*/projectname-*.jar) and then
use it while submitting. If you are not using spark-submit then you can
simply add this jar to spark by
I went through complex hierarchal JSON structures and Spark seems to fail in
querying them no matter what syntax is used.
Hope this helps,
Regards,
Alessandro
On Dec 8, 2014, at 6:05 AM, Raghavendra Pandey raghavendra.pan...@gmail.com
wrote:
Yeah, the dot notation works. It works even
@Daniel
It's true that the first map in your code is needed, i.e. mapping so that
dstID is the new RDD key.
The self-join on the dstKey will then create all the pairs of incoming
edges (plus self-referential and duplicates that need to be filtered out).
@Koert
Are there any guidelines about
On Mon, Dec 8, 2014 at 5:26 PM, Theodore Vasiloudis
theodoros.vasilou...@gmail.com wrote:
@Daniel
It's true that the first map in your code is needed, i.e. mapping so that
dstID is the new RDD key.
You wrote groupByKey is highly inefficient due to the need to shuffle all
the data, but you
My thesis is related to big data mining and I have a cluster in the
laboratory of my university. My task is to install apache spark on it and
use it for extraction purposes. Is there any understandable guidance on how
to do this ?
--
View this message in context:
On a rough note,
Step 1: Install Hadoop2.x in all the machines on cluster
Step 2: Check if Hadoop cluster is working
Step 3: Setup Apache Spark as given on the documentation page for the
cluster.
Check the status of cluster on the master UI
As it is some data mining project, configure Hive too.
Hi,
I am intending to save the streaming data from kafka into Cassandra,
using spark-streaming:
But there seems to be problem with line
javaFunctions(data).writerBuilder(testkeyspace, test_table,
mapToRow(TestTable.class)).saveToCassandra();
I am getting 2 errors.
the code, the error-log and
This is fixed in 1.2. Also, in 1.2+ you could call row.asDict() to
convert the Row object into dict.
On Mon, Dec 8, 2014 at 6:38 AM, sahanbull sa...@skimlinks.com wrote:
Hi Guys,
I used applySchema to store a set of nested dictionaries and lists in a
parquet file.
@Daniel
Not an expert either, I'm just going by what I see performance-wise
currently. Our groupByKey implementation was more than an order of
magnitude slower than using the self join and then reduceByKey.
FTA:
*pairs on the same machine with the same key are combined (by using the
lamdba
Hi,
https://github.com/databricks/spark-perf/tree/master/streaming-tests/src/main/scala/streaming/perf
contains some performance tests for streaming. There are examples of how to
generate synthetic files during the test in that repo, maybe you
can find some code snippets that you can use there.
Hello Everyone,
I'm brand new to spark and was wondering if there's a JDBC driver to access
spark-SQL directly. I'm running spark in standalone mode and don't have
hadoop in this environment.
--
*Best Regards/أطيب المنى,*
*Anas Mosaad*
You can use thrift server for this purpose then test it with beeline.
See doc:
https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
From: Anas Mosaad [mailto:anas.mos...@incorta.com]
Sent: Monday, December 8, 2014 11:01 AM
To: user@spark.apache.org
This is by hive's design. From the Hive documentation:
The column change command will only modify Hive's metadata, and will not
modify data. Users should make sure the actual data layout of the
table/partition conforms with the metadata definition.
On Sat, Dec 6, 2014 at 8:28 PM, Jianshi
Update:
The issue in my previous post was solved:
I had to change the sbt file name from project_name.sbt to build.sbt.
-
Thanks!
-Caron
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/scopt-OptionParser-tp8436p20581.html
Sent from the Apache Spark
Hey Tobias,
Can you try using the YARN Fair Scheduler and set
yarn.scheduler.fair.continuous-scheduling-enabled to true?
-Sandy
On Sun, Dec 7, 2014 at 5:39 PM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
thanks for your responses!
On Sat, Dec 6, 2014 at 4:22 AM, Sandy Ryza
Hello Jianshi,
You meant you want to convert a Map to a Struct, right? We can extract some
useful functions from JsonRDD.scala, so others can access them.
Thanks,
Yin
On Mon, Dec 8, 2014 at 1:29 AM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
I checked the source code for inferSchema. Looks
Hi,
I think you have the right idea. I would not even worry about flatMap.
val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =
generateRandomObject(x))
Then when you try to evaluate something on this RDD, it will happen
partition-by-partition. So 1000 random objects will be
OK, have waded into implementing this and have gotten pretty far, but am now
hitting something I don't understand, an NoSuchMethodError.
The code looks like
[...]
val conf = new SparkConf().setAppName(appName)
//conf.set(fs.default.name, file://);
val sc = new
Hi All,
I was able to run LinearRegressionwithSGD for a largeer dataset ( 2GB sparse).
I have now filtered the data and I am running regression on a subset of it (~
200 MB). I see this error, which is strange since it was running fine with the
superset data. Is this a formatting issue
You just need to use the latest master code without any configuration
to get performance improvement from my PR.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai
On Mon, Dec 8, 2014 at 7:53
looks good but how do I say that in Java
as far as I can see sc.parallelize (in Java) has only one implementation
which takes a List - requiring an in memory representation
On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
Hi,
I think you have the right
You can call .schema on SchemaRDDs. For example:
results.schema.fields.map(_.name)
On Sun, Dec 7, 2014 at 11:36 PM, abhishek reachabhishe...@gmail.com wrote:
Hi,
I have iplRDD which is a json, and I do below steps and query through
hivecontext. I get the results but without columns
Hi,
I need to generate some flags based on certain columns and add it back to
the schemaRDD for further operations. Do I have to use case class
(reflection or programmatically). I am using parquet files, so schema is
being automatically derived. This is a great feature. thanks to Spark
Hi Jake,
The toString method should print the full model in versions 1.1.x.
The current master branch has a method toDebugString for
DecisionTreeModel which should print out all the node classes and the
toString method has been updated to print the summary only so there is a
slight change in the
Hi,
Is there any way to auto calculate the optimum learning rate or stepsize via
MLLIB for SGD ?
Thx
tri
At 2014-12-08 12:12:16 -0800, spr s...@yarcdata.com wrote:
OK, have waded into implementing this and have gotten pretty far, but am now
hitting something I don't understand, an NoSuchMethodError.
[...]
The (short) traceback looks like
Exception in thread main java.lang.NoSuchMethodError:
Hi Bui,
Please use BFGS based solvers...For BFGS you don't have to specify step
size since the line search will find sufficient decrease each time...
Regularization you still have to do grid search...it's not possible to
automate that but on master you will find nice ways to automate grid
I am trying to create a CSV file, I have managed to create the actual string
I want to output to a file, but when I try to write the file, I get the
following error.
/saveAsTextFile is not a member of String/
My string is perfect, when I call this line to actually save the file, I get
the above
Running JavaAPISuite (master branch) on Linux, I got:
testGuavaOptional(org.apache.spark.JavaAPISuite) Time elapsed: 32.945 sec
ERROR!
org.apache.spark.SparkException: Job aborted due to stage failure: Master
removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org
Yeah, spark has very little overhead per partition, so generally more
partitions is better.
On Mon, Dec 8, 2014 at 1:46 PM, Theodore Vasiloudis
theodoros.vasilou...@gmail.com wrote:
@Daniel
Not an expert either, I'm just going by what I see performance-wise
currently. Our groupByKey
assume I don't care about values which may be created in a later map - in
scala I can say
val rdd = sc.parallelize(1 to 10, numSlices = 1000)
but in Java JavaSparkContext can only paralellize a List - limited to
Integer,MAX_VALUE elements and required to exist in memory - the best I can
Hi -
Does anybody have any ideas how to dynamically allocate cores instead of
statically partitioning them among multiple applications? Thanks.
Mohammed
From: Mohammed Guller
Sent: Friday, December 5, 2014 11:26 PM
To: user@spark.apache.org
Subject: Fair scheduling accross applications in
Hi,
I'm wondering whether there is an efficient way to continuously append
new data to a registered spark SQL table.
This is what I want: I want to make an ad-hoc query service to a
json formated system log. Certainly, the system log is continuously generated.
I will use
I'm using CDH 5.1.0 and Spark 1.0.0, and I'd like to write out data
as snappy-compressed files but encounted a problem.
My code is as follows:
val InputTextFilePath = hdfs://ec2.hadoop.com:8020/xt/text/new.txt
val OutputTextFilePath = hdfs://ec2.hadoop.com:8020/xt/compressedText/
val
Steve, Something like this will do I think = sc.parallelize(1 to 1000,
1000).flatMap(x = 1 to 10)
the above will launch 1000 tasks (maps), with each task creating 10^5
numbers (total of 100 million elements)
On Mon, Dec 8, 2014 at 6:17 PM, Steve Lewis lordjoe2...@gmail.com wrote:
assume
I have a question as the title says, the question link is
http://stackoverflow.com/questions/27370170/query-classification-using-apache-spark-mlib,thanks
Jin
Pretty straightforward: Using Scala, I have an RDD that represents a table
with four columns. What is the recommended way to convert the entire RDD to
one JSON object?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-convert-RDD-to-JSON-tp20585.html
Hi,
On Tue, Dec 9, 2014 at 4:39 AM, Sandy Ryza sandy.r...@cloudera.com wrote:
Can you try using the YARN Fair Scheduler and set
yarn.scheduler.fair.continuous-scheduling-enabled to true?
I'm using Cloudera 5.2.0 and my configuration says
yarn.resourcemanager.scheduler.class =
Ah... I see. Thanks for pointing it out.
Then it means we cannot mount external table using customized column names.
hmm...
Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.
Thanks,
Jianshi
On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust
You don't need to worry about locks as such as one thread/worker is
responsible exclusively for one partition of the RDD. You can use
Accumulator variables that spark provides to get the state updates.
On Mon Dec 08 2014 at 8:14:28 PM aditya.athalye adbrihadarany...@gmail.com
wrote:
I am
Hi Experts!
I want to save DStream to HDFS only if it is not empty such that it contains
some kafka messages to be stored. What is an efficient way to do this.
var data = KafkaUtils.createStream[Array[Byte], Array[Byte],
DefaultDecoder, DefaultDecoder](ssc, params, topicMap,
I am facing a somewhat confusing problem:
My spark app reads data from a database, calculates certain values and then
runs a shortest path Pregel operation on them. If I save the RDD to disk and
then read the information out again, my app runs between 30-50% faster than
keeping it in memory, plus
Hi allIn my spark application,I load a csv file and map the datas to a Map
vairable for later uses on driver node ,then broadcast it,every thing works
fine untill the exception java.io.FileNotFoundException occurs.the console log
information shows me the broadcast unavailable,I googled this
You cannot pass the sc object (*val b = Utils.load(sc,ip_lib_path)*) inside
a map function and that's why the Serialization exception is popping up(
since sc is not serializable). You can try tachyon's cache if you want to
persist the data in memory kind of forever.
Thanks
Best Regards
On Tue,
Hi yuemeng,
Are you possibly running the Capacity Scheduler with the default resource
calculator?
-Sandy
On Sat, Dec 6, 2014 at 7:29 PM, yuemeng1 yueme...@huawei.com wrote:
Hi, all
When i running an app with this cmd: ./bin/spark-sql --master
yarn-client --num-executors 2
Another thing to be aware of is that YARN will round up containers to the
nearest increment of yarn.scheduler.minimum-allocation-mb, which defaults
to 1024.
-Sandy
On Sat, Dec 6, 2014 at 3:48 PM, Denny Lee denny.g@gmail.com wrote:
Got it - thanks!
On Sat, Dec 6, 2014 at 14:56 Arun Ahuja
Hi All,
I am facing following problem on Spark-1.2 rc1 where I get Treenode
exception (unresolved attributes) :-
https://issues.apache.org/jira/browse/SPARK-2063
To avoid this, I do something following :-
val newSchemaRDD = sqlContext.applySchema(existingSchemaRDD,
existingSchemaRDD.schema)
I meet the same issue. Any solution?
On Wed, Nov 12, 2014 at 2:54 PM, tridib tridib.sama...@live.com wrote:
Hi Friends,
I am trying to save a json file to parquet. I got error Unsupported
datatype TimestampType.
Is not parquet support date? Which parquet version does spark uses? Is
there
RDD is just a wrap of the scala collection, Maybe you can use the
.collect() method to get the scala collection type, you can then transfer
to a JSON object using the Scala method.
Hi all:I was running HiveFromSpark on yarn-cluster. While I got the hive
select's result schemaRDD and tried to run `collect()` on it, the application
got stuck and don't know what's wrong with it. Here is my code:
val sqlStat = sSELECT * FROM $TABLE_NAME val result =
hiveContext.hql(sqlStat)
63 matches
Mail list logo