Hi All
I have a use case where where I am consuming the Events from RabbitMQ using
spark streaming.This event has some fields on which I want to query the
PostgreSQL and bring the data and then do the join between event data and
PostgreSQl data and put the aggregated data into HDFS, so that I run
YARN and Mesos are better for production clusters of non-trivial size
that have mixed job kinds and multiple users, as they manage resources more
intelligently and dynamically. They also support other services you
probably need, like HDFS, databases, workflow tools, etc.
Standalone is fine,
Bagavath,
Sometimes we need to merge existing records, due to recomputations of the
whole data. I don't think we could achieve this with pure insert, or is
there a way?
On 24 July 2015 at 08:53, Bagavath bagav...@gmail.com wrote:
Try using insert instead of merge. Typically we use insert
Hello,
I´ve got a problem using Spark with Geomesa. I´m not quite sure where the
error comes from, but I assume its problem with Spark.
A ClassNotFoundException is thrown with following content: Failed to
register classes with Kryo.
Please have a look at https://github.com/apache/spark/pull/4258
Hi all,
I wonder if any one has an explanation for this behavior.
Thank you,
-Khaled
-- Forwarded message --
From: Khaled Ammar khaled.am...@gmail.com
Date: Fri, Jul 24, 2015 at 9:35 AM
Subject: Performance questions regarding Spark 1.3 standalone mode
To: user@spark.apache.org
Hi Saif:
There is also the Spark Kernel which provides you the auto-complete,
logs and syntax highlighting for scala on the notebook (ex. jupyter)
https://github.com/ibm-et/spark-kernel
There was a recent meetup that talked about it in case you are
interested in the technical details:
You can have Spark reading from PostgreSQL through the data access API. Do you
have any concern with that approach since you mention copying that data into
HBase.
From: Jeetendra Gangele
Sent: Monday, July 27, 6:00 AM
Subject: Data from PostgreSQL to Spark
To: user
Hi All
I have a
You refer to `aws_security_token`, but I'm not sure where you're specifying
it. Can you elaborate? Is it an environment variable?
On Mon, Jul 27, 2015 at 4:21 AM Jan Zikeš jan.zi...@centrum.cz wrote:
Hi,
I would like to ask if it is currently possible to use spark-ec2 script
together with
Hello,
I am trying to start a Spark master for a standalone cluster on an EC2 node.
The CLI command I'm using looks like this:
Note that I'm specifying the --host argument; I want my Spark master to be
listening on a specific IP address. The host that I'm specifying (i.e.
54.xx.xx.xx) is the
Any updates on this bug ?
Why Spark log results Job final status does not match ? (one saying that
job has failed, another stating that job has succeeded)
Thanks.
On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov elkhan8...@gmail.com
wrote:
Hi all,
While running Spark Word count python
Does SparkR support all the algorithms that R library supports?
Hmm, it should work with you run `PYSPARK_DRIVER_PYTHON=ipython
spark/bin/pyspark`
PYTHONSTARTUP is a PYTHON environment variable
https://docs.python.org/2/using/cmdline.html#envvar-PYTHONSTARTUP
On Sun, Jul 26, 2015 at 4:06 PM -0700, Zerony Zhao bw.li...@gmail.com wrote:
Hello everyone,
Hi spark users and developers,
I have been trying to understand how Spark SQL works with Parquet for the
couple of days. There is a performance problem that is unexpected using the
column pruning. Here is a dummy example:
The parquet file has the 3 fields:
|-- customer_id: string (nullable =
bq. on one node it works but on the other it gives me the above error.
Can you tell us the difference between the environments on the two nodes ?
Does the other node use Java 8 ?
Cheers
On Mon, Jul 27, 2015 at 11:38 AM, Rahul Palamuttam rahulpala...@gmail.com
wrote:
Hi All,
I hope this is
So just to clarify, I have 4 nodes, all of which use Java 8.
Only one of them is able to successfully execute the build/sbt assembly
command.
However on the 3 others I get the error.
If I run sbt assembly in Spark Home, it works and I'm able to launch the
master and worker processes.
On Mon, Jul
All nodes are using java 8.
I've tried to mimic the environments as much as possible among all nodes.
On Mon, Jul 27, 2015 at 11:44 AM, Ted Yu yuzhih...@gmail.com wrote:
bq. on one node it works but on the other it gives me the above error.
Can you tell us the difference between the
Hi, Akhil,
Yes, in the build.sbt I wrongly set it to the installed scala version of
2.11.6 on the cluster, fixed now. Thanks!
Cheers,
Dan
2015-07-27 2:29 GMT-05:00 Akhil Das ak...@sigmoidanalytics.com:
Whats in your build.sbt? You could be messing with the scala version it
seems.
Hi all,
would like some insight. I am currently computing huge databases, and playing
with monitoring and tunning.
When monitoring the multiple cores I have, I see that even when RDDs are
parallelized, computation on the RDD jump from core to core sporadically ( I
guess, depending on where
Hi All,
I hope this is the right place to post troubleshooting questions.
I've been following the install instructions and I get the following error
when running the following from Spark home directory
$./build/sbt
Using /usr/java/jdk1.8.0_20/ as default JAVA_HOME.
Note, this will be overridden
Thanks for your reply.
Parallel i will be hitting around 6000 call to postgreSQl which is not good
my database will die.
these calls to database will keeps on increasing.
Handling millions on request is not an issue with Hbase/NOSQL
any other alternative?
On 27 July 2015 at 23:18,
You can call dB connect once per partition. Please have a look at design
patterns of for each construct in document.
How big is your data in dB? How soon that data changes? You would be better
off if data is in spark already
On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:
I can't migrate this PostgreSQL data since lots of system using it,but I can
take this data to some NOSQL like base and query the Hbase, but here issue is
How can I make sure that Hbase has upto date data?
Is velocity an issue in Postgres that your data would become stale as soon as
it
Hi,
I am running pyspark in windows and I am seeing an error while adding
pyfiles to the sparkcontext. below is the example,
sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip)
this fails with no file found error for C
The below logic is treating the path as individual files like C,
Thank you so much.
I found the issue. My fault, the stock ipython version 0.12.1 is too old,
which does not support PYTHONSTARTUP. Upgrading ipython solved the issue.
On Mon, Jul 27, 2015 at 12:43 PM, felixcheun...@hotmail.com wrote:
Hmm, it should work with you run
It expects an iterable, and if you iterate over a string, you get the
individual characters. Use a list instead:
pyfiles=['/path/to/file']
On Mon, Jul 27, 2015 at 2:40 PM, Naveen Madhire vmadh...@umail.iu.edu
wrote:
Hi,
I am running pyspark in windows and I am seeing an error while adding
I'm a bit confused about the documentation in the area of Hive support.
I want to use a remote Hive metastore/hdfs server and the documentation says
that we need to build Spark from source due to the large number of
dependencies Hive requires.
Specifically the documentation says:
Hive has a
Probably relevant to people on this list: on Friday I released a clone of
the Spark web UI built using Meteor https://www.meteor.com/ so that
everything updates in real-time, saving you from endlessly refreshing the
page while jobs are running :) It can also serve as the UI for running as
well as
Hi,
I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum
output file size when writing out from SparkSQL? So far, I have tried:
--x-
import sqlContext.implicits._
sc.hadoopConfiguration.setBoolean(fs.hdfs.impl.disable.cache,true)
Simply no. Currently SparkR is the R API of Spark DataFrame, no existing
algorithms can benefit from it unless they are re-written to be based on the
API.
There is on-going development on supporting MLlib and ML Pipelines in SparkR:
https://issues.apache.org/jira/browse/SPARK-6805
From: Mohit
Hello everyone,
Another newbie question.
PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark runs fine, (in $SPARK_HOME)
Python 2.7.10 (default, Jul 3 2015, 01:26:20)
Type copyright, credits or license for more information.
IPython 3.2.1 -- An enhanced Interactive Python.
? - Introduction and
Hi.
Have you ruled out that this may just be I/O time?
Word count is a very light-wight task for the CPU but you will be needing
to read the initial data from what ever storage device you have your HDFS
running on.
As you have 3 machines, 22 cores each but perhaps just one or a few HDD /
SSD /
Hello all,
I am currently having an error with Spark SQL access Elasticsearch using
Elasticsearch Spark integration. Below is the series of command I issued
along with the stacktrace. I am unclear what the error could mean. I can
print the schema correctly but error out if i try and display a
Hi,
What is the proper Json parsing library to use in Spark Streaming? Currently
I am trying to use Gson library in a Java class and calling the Java method
from a Scala class as shown below: What are the advantages of using Json4S
as against using Gson library in a Java class and calling it from
json4s is used by https://github.com/hammerlab/spark-json-relay
See the other thread on 'Spree'
FYI
On Mon, Jul 27, 2015 at 6:07 PM, swetha swethakasire...@gmail.com wrote:
Hi,
What is the proper Json parsing library to use in Spark Streaming?
Currently
I am trying to use Gson library in
No with s3a, I have the following error :
java.lang.NoSuchMethodError:
com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:285)
2015-07-27 11:17 GMT+02:00 Akhil Das
That error is a jar conflict, you must be having multiple versions of
hadoop jar in the classpath. First you make sure you are able to access
your AWS S3 with s3a, then you give the endpoint configuration and try to
access the custom storage.
Thanks
Best Regards
On Mon, Jul 27, 2015 at 4:02 PM,
Internally I believe that we only actually create one struct object for
each row, so you are really only paying the cost of the pointer in most use
cases (as shown below).
scala val df = Seq((1,2), (3,4)).toDF(a, b)
df: org.apache.spark.sql.DataFrame = [a: int, b: int]
scala df.collect()
res1:
Hi Yan,
Is it possible to access the hbase table through spark sql jdbc layer ?
Thanks.
Deb
On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote:
Yes, but not all SQL-standard insert variants .
*From:* Debasish Das [mailto:debasish.da...@gmail.com]
*Sent:* Wednesday, July
when using spark-submit: which directory contains third party libraries
that will be loaded on each of the slaves? I would like to scp one or more
libraries to each of the slaves instead of shipping the contents in the
application uber-jar.
Note: I did try adding to $SPARK_HOME/lib_managed/jars.
I'm currently using Spark 1.4 in standalone mode.
I've forked the Apache Hive branch from https://github.com/pwendell/hive
https://github.com/pwendell/hive and customised in the following way.
Added a thread local variable in SessionManager class. And I'm setting the
session variable in my
Hi,
I got a error when running spark streaming as below .
java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at
Hi all,
I am running Spark 1.4.1 on mesos 0.23.0
While I am able to start spark-shell on the node with mesos-master running,
it works fine. But when I try to start spark-shell on mesos-slave nodes,
I'm encounter this error. I greatly appreciate any help.
15/07/27 22:14:44 INFO Utils:
Hi, Spark Users
Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of
Cygwin support in bin/spark-class
The changeset is
https://github.com/apache/spark/commit/517975d89d40a77c7186f488547eed11f79c1e97#diff-fdf4d3e600042c63ffa17b692c4372a3
The changeset said Add a library for
Hi Jerry,
Thanks for the detailed report! I haven't investigate this issue in
detail. But for the input size issue, I believe this is due to a
limitation of HDFS API. It seems that Hadoop FileSystem adds the size of
a whole block to the metrics even if you only touch a fraction of that
Elkhan,
What does the ResourceManager say about the final status of the job? Spark
jobs that run as Yarn applications can fail but still successfully clean up
their resources and give them back to the Yarn cluster. Because of this,
there's a difference between your code throwing an exception in
Hi all,
SparkSQL usually creates DataFrame with GenericRowWithSchema(is that
right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema.
The only difference is that GenericRowWithSchema has its schema information
as StructType. But I think one DataFrame has only one schema then
Hello,
I would like to add a column of StructType to DataFrame.
What would be the best way to do it? Not sure if it is possible using
withColumn. A possible way is to convert the dataframe into a RDD[Row], add
the struct and then convert it back to dataframe. But that seems an
overkill.
Please
Hi,
May I know how to use the functions mentioned in
http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$
in spark sql?
when I use like
Select last(column) from tablename I am getting error like
15/07/27 03:00:00 INFO exec.FunctionRegistry: Unable to lookup
Hi,
I would like to ask if it is currently possible to use spark-ec2 script
together with credentials that are consisting not only from:
aws_access_key_id and aws_secret_access_key, but it also contains
aws_security_token.
When I try to run the script I am getting following error message:
How about IntelliJ? It also has a Terminal tab.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 6:06 PM, saif.a.ell...@wellsfargo.com wrote:
Hi all,
I tried Notebook Incubator Zeppelin, but I am not completely happy with it.
What do you people use for coding? Anything with auto-complete,
do you mean something like this ?
val values = rdd.mapPartitions{ i: Iterator[Future[T]] =
val future: Future[Iterator[T]] = Future sequence i
Await result (future, someTimeout)
}
Where is the blocking happening in this case? It seems to me that all the
workers will be blocked until the
Have a look at the current security support
https://spark.apache.org/docs/latest/security.html, Spark does not have
any encryption support for objects in memory out of the box. But if your
concern is to protect the data being cached in memory, then you can easily
encrypt your objects in memory
Hi, there
I test with sqlContext.sql(select funcName(param1,param2,...) from tableName )
just worked fine.
Would you like to paste your test code here ? And which version of Spark are u
using ?
Best,
Sun.
fightf...@163.com
From: vinod kumar
Date: 2015-07-27 15:04
To: User
Subject:
its for 1 day events in range of 1 billions and processing is in streaming
application of ~10-15 sec interval so lookup should be fast. RDD need to
be updated with new events and old events of current time-24 hours back
should be removed at each processing.
So is spark RDD not fit for this
You can follow this doc
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup
Thanks
Best Regards
On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com wrote:
Hi All,
I am trying to setup the Eclipse (LUNA) with Maven so that I
RDD is immutable, it cannot be changed, you can only create a new one from
data or from transformation. It sounds inefficient to create one each 15
sec for the last 24 hours.
I think a key-value store will be much more fitted for this purpose.
On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora
Its a serialization error with nested schema i guess. You can look at the
twitters chill avro serializer library. Here's two discussion on the same:
- https://issues.apache.org/jira/browse/SPARK-3447
-
Whats in your build.sbt? You could be messing with the scala version it
seems.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 2:15 AM, Dan Dong dongda...@gmail.com wrote:
Hi,
When I ran with spark-submit the following simple Spark program of:
import org.apache.spark.SparkContext._
import
Hi all:I am testing the performance of hive on spark sql.The existing table is
created with ROW FORMAT SERDE
'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (
'input.regex' =
What the throughput of processing and for how long do you need to remember
duplicates?
You can take all the events, put them in an RDD, group by the key, and then
process each key only once.
But if you have a long running application where you want to check that you
didn't see the same value
This spark.shuffle.sort.bypassMergeThreshold might help, You could also try
setting the shuffle manager to hash from sort. You can see more
configuration options from here
https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 3:33
Hi,
I would like to ask if it is currently possible to use spark-ec2 script
together with credentials that are consisting not only from: aws_access_key_id
and aws_secret_access_key, but it also contains aws_security_token.
When I try to run the script I am getting following error message:
In this case, each partition will block until the futures in that partition
are completed.
If you are in the end collecting all the Futures to the driver, what is the
reasoning behind using an RDD? You could just use a bunch of Futures
directly.
If you want to do some processing on the results
Hi,
Select last(product) from sampleTable
Spark Version 1.3
-Vinod
On Mon, Jul 27, 2015 at 3:48 AM, fightf...@163.com fightf...@163.com
wrote:
Hi, there
I test with sqlContext.sql(select funcName(param1,param2,...) from
tableName ) just worked fine.
Would you like to paste your test
For each of your job, you can pass spark.ui.port to bind to a different
port.
Thanks
Best Regards
On Fri, Jul 24, 2015 at 7:49 PM, Joji John jj...@ebates.com wrote:
Thanks Ajay.
The way we wrote our spark application is that we have a generic python
code, multiple instances of which can
So you are able to access your AWS S3 with s3a now? What is the error that
you are getting when you try to access the custom storage with
fs.s3a.endpoint?
Thanks
Best Regards
On Mon, Jul 27, 2015 at 2:44 PM, Schmirr Wurst schmirrwu...@gmail.com
wrote:
I was able to access Amazon S3, but for
I am implementing wordcount on the spark cluster (1 master, 3 slaves) in
standalone mode. I have 546G data, and the dfs.blocksize I set is 256MB.
Therefore, the amount of tasks are 2186. My 3 slaves each uses 22 cores and
72 memory to do the processing, so the computing ability of each slave
Hi,
Do you mean you are running the script with
https://github.com/amplab-extras/SparkR-pkg and spark 1.2? I am afraid that
currently there is no development effort and support on the SparkR-pkg since
it has been integrated into Spark since Spark 1.4.
Unfortunately, the RDD API and RDD-like
I was able to access Amazon S3, but for some reason, the Endpoint
parameter is ignored, and I'm not able to access to storage from my
provider... :
sc.hadoopConfiguration.set(fs.s3a.endpoint,test)
sc.hadoopConfiguration.set(fs.s3a.awsAccessKeyId,)
69 matches
Mail list logo