Data from PostgreSQL to Spark

2015-07-27 Thread Jeetendra Gangele
Hi All I have a use case where where I am consuming the Events from RabbitMQ using spark streaming.This event has some fields on which I want to query the PostgreSQL and bring the data and then do the join between event data and PostgreSQl data and put the aggregated data into HDFS, so that I run

Re: Comparison between Standalone mode and YARN mode

2015-07-27 Thread Dean Wampler
YARN and Mesos are better for production clusters of non-trivial size that have mixed job kinds and multiple users, as they manage resources more intelligently and dynamically. They also support other services you probably need, like HDFS, databases, workflow tools, etc. Standalone is fine,

Re: Performance issue with Spak's foreachpartition method

2015-07-27 Thread diplomatic Guru
Bagavath, Sometimes we need to merge existing records, due to recomputations of the whole data. I don't think we could achieve this with pure insert, or is there a way? On 24 July 2015 at 08:53, Bagavath bagav...@gmail.com wrote: Try using insert instead of merge. Typically we use insert

Spark - Serialization with Kryo

2015-07-27 Thread Pa Rö
Hello, I´ve got a problem using Spark with Geomesa. I´m not quite sure where the error comes from, but I assume its problem with Spark. A ClassNotFoundException is thrown with following content: Failed to register classes with Kryo. Please have a look at https://github.com/apache/spark/pull/4258

Fwd: Performance questions regarding Spark 1.3 standalone mode

2015-07-27 Thread Khaled Ammar
Hi all, I wonder if any one has an explanation for this behavior. Thank you, -Khaled -- Forwarded message -- From: Khaled Ammar khaled.am...@gmail.com Date: Fri, Jul 24, 2015 at 9:35 AM Subject: Performance questions regarding Spark 1.3 standalone mode To: user@spark.apache.org

Re: suggest coding platform

2015-07-27 Thread Guillermo Cabrera
Hi Saif: There is also the Spark Kernel which provides you the auto-complete, logs and syntax highlighting for scala on the notebook (ex. jupyter) https://github.com/ibm-et/spark-kernel There was a recent meetup that talked about it in case you are interested in the technical details:

Re: Data from PostgreSQL to Spark

2015-07-27 Thread felixcheung_m
You can have Spark reading from PostgreSQL through the data access API. Do you have any concern with that approach since you mention copying that data into HBase. From: Jeetendra Gangele Sent: Monday, July 27, 6:00 AM Subject: Data from PostgreSQL to Spark To: user Hi All I have a

Re: spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread Nicholas Chammas
You refer to `aws_security_token`, but I'm not sure where you're specifying it. Can you elaborate? Is it an environment variable? On Mon, Jul 27, 2015 at 4:21 AM Jan Zikeš jan.zi...@centrum.cz wrote: Hi, I would like to ask if it is currently possible to use spark-ec2 script together with

Getting java.net.BindException when attempting to start Spark master on EC2 node with public IP

2015-07-27 Thread Wayne Song
Hello, I am trying to start a Spark master for a standalone cluster on an EC2 node. The CLI command I'm using looks like this: Note that I'm specifying the --host argument; I want my Spark master to be listening on a specific IP address. The host that I'm specifying (i.e. 54.xx.xx.xx) is the

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-27 Thread Elkhan Dadashov
Any updates on this bug ? Why Spark log results Job final status does not match ? (one saying that job has failed, another stating that job has succeeded) Thanks. On Thu, Jul 23, 2015 at 4:43 PM, Elkhan Dadashov elkhan8...@gmail.com wrote: Hi all, While running Spark Word count python

SparkR

2015-07-27 Thread Mohit Anchlia
Does SparkR support all the algorithms that R library supports?

Re: PYSPARK_DRIVER_PYTHON=ipython spark/bin/pyspark Does not create SparkContext

2015-07-27 Thread felixcheung_m
Hmm, it should work with you run `PYSPARK_DRIVER_PYTHON=ipython spark/bin/pyspark` PYTHONSTARTUP is a PYTHON environment variable https://docs.python.org/2/using/cmdline.html#envvar-PYTHONSTARTUP On Sun, Jul 26, 2015 at 4:06 PM -0700, Zerony Zhao bw.li...@gmail.com wrote: Hello everyone,

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
Hi spark users and developers, I have been trying to understand how Spark SQL works with Parquet for the couple of days. There is a performance problem that is unexpected using the column pruning. Here is a dummy example: The parquet file has the 3 fields: |-- customer_id: string (nullable =

Re: Spark build/sbt assembly

2015-07-27 Thread Ted Yu
bq. on one node it works but on the other it gives me the above error. Can you tell us the difference between the environments on the two nodes ? Does the other node use Java 8 ? Cheers On Mon, Jul 27, 2015 at 11:38 AM, Rahul Palamuttam rahulpala...@gmail.com wrote: Hi All, I hope this is

Re: Spark build/sbt assembly

2015-07-27 Thread Rahul Palamuttam
So just to clarify, I have 4 nodes, all of which use Java 8. Only one of them is able to successfully execute the build/sbt assembly command. However on the 3 others I get the error. If I run sbt assembly in Spark Home, it works and I'm able to launch the master and worker processes. On Mon, Jul

Re: Spark build/sbt assembly

2015-07-27 Thread Rahul Palamuttam
All nodes are using java 8. I've tried to mimic the environments as much as possible among all nodes. On Mon, Jul 27, 2015 at 11:44 AM, Ted Yu yuzhih...@gmail.com wrote: bq. on one node it works but on the other it gives me the above error. Can you tell us the difference between the

Re: java.lang.NoSuchMethodError for list.toMap.

2015-07-27 Thread Dan Dong
Hi, Akhil, Yes, in the build.sbt I wrongly set it to the installed scala version of 2.11.6 on the cluster, fixed now. Thanks! Cheers, Dan 2015-07-27 2:29 GMT-05:00 Akhil Das ak...@sigmoidanalytics.com: Whats in your build.sbt? You could be messing with the scala version it seems.

CPU Parallelization not being used (local mode)

2015-07-27 Thread Saif.A.Ellafi
Hi all, would like some insight. I am currently computing huge databases, and playing with monitoring and tunning. When monitoring the multiple cores I have, I see that even when RDDs are parallelized, computation on the RDD jump from core to core sporadically ( I guess, depending on where

Spark build/sbt assembly

2015-07-27 Thread Rahul Palamuttam
Hi All, I hope this is the right place to post troubleshooting questions. I've been following the install instructions and I get the following error when running the following from Spark home directory $./build/sbt Using /usr/java/jdk1.8.0_20/ as default JAVA_HOME. Note, this will be overridden

Re: Data from PostgreSQL to Spark

2015-07-27 Thread Jeetendra Gangele
Thanks for your reply. Parallel i will be hitting around 6000 call to postgreSQl which is not good my database will die. these calls to database will keeps on increasing. Handling millions on request is not an issue with Hbase/NOSQL any other alternative? On 27 July 2015 at 23:18,

Re: Data from PostgreSQL to Spark

2015-07-27 Thread ayan guha
You can call dB connect once per partition. Please have a look at design patterns of for each construct in document. How big is your data in dB? How soon that data changes? You would be better off if data is in spark already On 28 Jul 2015 04:48, Jeetendra Gangele gangele...@gmail.com wrote:

Re: Data from PostgreSQL to Spark

2015-07-27 Thread santoshv98
I can't migrate this PostgreSQL data since lots of system using it,but I can take this data to some NOSQL like base and query the Hbase, but here issue is How can I make sure that Hbase has upto date data? Is velocity an issue in Postgres that your data would become stale as soon as it

pyspark issue

2015-07-27 Thread Naveen Madhire
Hi, I am running pyspark in windows and I am seeing an error while adding pyfiles to the sparkcontext. below is the example, sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip) this fails with no file found error for C The below logic is treating the path as individual files like C,

Re: PYSPARK_DRIVER_PYTHON=ipython spark/bin/pyspark Does not create SparkContext

2015-07-27 Thread Zerony Zhao
Thank you so much. I found the issue. My fault, the stock ipython version 0.12.1 is too old, which does not support PYTHONSTARTUP. Upgrading ipython solved the issue. On Mon, Jul 27, 2015 at 12:43 PM, felixcheun...@hotmail.com wrote: Hmm, it should work with you run

Re: pyspark issue

2015-07-27 Thread Sven Krasser
It expects an iterable, and if you iterate over a string, you get the individual characters. Use a list instead: pyfiles=['/path/to/file'] On Mon, Jul 27, 2015 at 2:40 PM, Naveen Madhire vmadh...@umail.iu.edu wrote: Hi, I am running pyspark in windows and I am seeing an error while adding

Do I really need to build Spark for Hive/Thrift Server support?

2015-07-27 Thread ReeceRobinson
I'm a bit confused about the documentation in the area of Hive support. I want to use a remote Hive metastore/hdfs server and the documentation says that we need to build Spark from source due to the large number of dependencies Hive requires. Specifically the documentation says: Hive has a

Spree: a live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
Probably relevant to people on this list: on Friday I released a clone of the Spark web UI built using Meteor https://www.meteor.com/ so that everything updates in real-time, saving you from endlessly refreshing the page while jobs are running :) It can also serve as the UI for running as well as

Controlling output fileSize in SparkSQL

2015-07-27 Thread Tim Smith
Hi, I am using Spark 1.3 (CDH 5.4.4). What's the recipe for setting a minimum output file size when writing out from SparkSQL? So far, I have tried: --x- import sqlContext.implicits._ sc.hadoopConfiguration.setBoolean(fs.hdfs.impl.disable.cache,true)

RE: SparkR

2015-07-27 Thread Sun, Rui
Simply no. Currently SparkR is the R API of Spark DataFrame, no existing algorithms can benefit from it unless they are re-written to be based on the API. There is on-going development on supporting MLlib and ML Pipelines in SparkR: https://issues.apache.org/jira/browse/SPARK-6805 From: Mohit

Weird error using absolute path to run pyspark when using ipython driver

2015-07-27 Thread Zerony Zhao
Hello everyone, Another newbie question. PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark runs fine, (in $SPARK_HOME) Python 2.7.10 (default, Jul 3 2015, 01:26:20) Type copyright, credits or license for more information. IPython 3.2.1 -- An enhanced Interactive Python. ? - Introduction and

Re: Why the length of each task varies

2015-07-27 Thread Gylfi
Hi. Have you ruled out that this may just be I/O time? Word count is a very light-wight task for the CPU but you will be needing to read the initial data from what ever storage device you have your HDFS running on. As you have 3 machines, 22 cores each but perhaps just one or a few HDD / SSD /

Spark SQL Error

2015-07-27 Thread An Tran
Hello all, I am currently having an error with Spark SQL access Elasticsearch using Elasticsearch Spark integration. Below is the series of command I issued along with the stacktrace. I am unclear what the error could mean. I can print the schema correctly but error out if i try and display a

Json parsing library for Spark Streaming?

2015-07-27 Thread swetha
Hi, What is the proper Json parsing library to use in Spark Streaming? Currently I am trying to use Gson library in a Java class and calling the Java method from a Scala class as shown below: What are the advantages of using Json4S as against using Gson library in a Java class and calling it from

Re: Json parsing library for Spark Streaming?

2015-07-27 Thread Ted Yu
json4s is used by https://github.com/hammerlab/spark-json-relay See the other thread on 'Spree' FYI On Mon, Jul 27, 2015 at 6:07 PM, swetha swethakasire...@gmail.com wrote: Hi, What is the proper Json parsing library to use in Spark Streaming? Currently I am trying to use Gson library in

Re: use S3-Compatible Storage with spark

2015-07-27 Thread Schmirr Wurst
No with s3a, I have the following error : java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:285) 2015-07-27 11:17 GMT+02:00 Akhil Das

Re: use S3-Compatible Storage with spark

2015-07-27 Thread Akhil Das
That error is a jar conflict, you must be having multiple versions of hadoop jar in the classpath. First you make sure you are able to access your AWS S3 with s3a, then you give the endpoint configuration and try to access the custom storage. Thanks Best Regards On Mon, Jul 27, 2015 at 4:02 PM,

Re: GenericRowWithSchema is too heavy

2015-07-27 Thread Michael Armbrust
Internally I believe that we only actually create one struct object for each row, so you are really only paying the cost of the pointer in most use cases (as shown below). scala val df = Seq((1,2), (3,4)).toDF(a, b) df: org.apache.spark.sql.DataFrame = [a: int, b: int] scala df.collect() res1:

RE: Package Release Annoucement: Spark SQL on HBase Astro

2015-07-27 Thread Debasish Das
Hi Yan, Is it possible to access the hbase table through spark sql jdbc layer ? Thanks. Deb On Jul 22, 2015 9:03 PM, Yan Zhou.sc yan.zhou...@huawei.com wrote: Yes, but not all SQL-standard insert variants . *From:* Debasish Das [mailto:debasish.da...@gmail.com] *Sent:* Wednesday, July

Which directory contains third party libraries for Spark

2015-07-27 Thread Stephen Boesch
when using spark-submit: which directory contains third party libraries that will be loaded on each of the slaves? I would like to scp one or more libraries to each of the slaves instead of shipping the contents in the application uber-jar. Note: I did try adding to $SPARK_HOME/lib_managed/jars.

Hive Session gets overwritten in ClientWrapper

2015-07-27 Thread Vishak
I'm currently using Spark 1.4 in standalone mode. I've forked the Apache Hive branch from https://github.com/pwendell/hive https://github.com/pwendell/hive and customised in the following way. Added a thread local variable in SessionManager class. And I'm setting the session variable in my

streaming issue

2015-07-27 Thread guoqing0...@yahoo.com.hk
Hi, I got a error when running spark streaming as below . java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at

Spark on Mesos - Shut down failed while running spark-shell

2015-07-27 Thread Haripriya Ayyalasomayajula
Hi all, I am running Spark 1.4.1 on mesos 0.23.0 While I am able to start spark-shell on the node with mesos-master running, it works fine. But when I try to start spark-shell on mesos-slave nodes, I'm encounter this error. I greatly appreciate any help. 15/07/27 22:14:44 INFO Utils:

NO Cygwin Support in bin/spark-class in Spark 1.4.0

2015-07-27 Thread Proust GZ Feng
Hi, Spark Users Looks like Spark 1.4.0 cannot work with Cygwin due to the removing of Cygwin support in bin/spark-class The changeset is https://github.com/apache/spark/commit/517975d89d40a77c7186f488547eed11f79c1e97#diff-fdf4d3e600042c63ffa17b692c4372a3 The changeset said Add a library for

Re: Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Cheng Lian
Hi Jerry, Thanks for the detailed report! I haven't investigate this issue in detail. But for the input size issue, I believe this is due to a limitation of HDFS API. It seems that Hadoop FileSystem adds the size of a whole block to the metrics even if you only touch a fraction of that

Re: [ Potential bug ] Spark terminal logs say that job has succeeded even though job has failed in Yarn cluster mode

2015-07-27 Thread Corey Nolet
Elkhan, What does the ResourceManager say about the final status of the job? Spark jobs that run as Yarn applications can fail but still successfully clean up their resources and give them back to the Yarn cluster. Because of this, there's a difference between your code throwing an exception in

GenericRowWithSchema is too heavy

2015-07-27 Thread Kevin Jung
Hi all, SparkSQL usually creates DataFrame with GenericRowWithSchema(is that right?). And 'Row' is a super class of GenericRow and GenericRowWithSchema. The only difference is that GenericRowWithSchema has its schema information as StructType. But I think one DataFrame has only one schema then

Create StructType column in data frame

2015-07-27 Thread Raghavendra Pandey
Hello, I would like to add a column of StructType to DataFrame. What would be the best way to do it? Not sure if it is possible using withColumn. A possible way is to convert the dataframe into a RDD[Row], add the struct and then convert it back to dataframe. But that seems an overkill. Please

Functions in Spark SQL

2015-07-27 Thread vinod kumar
Hi, May I know how to use the functions mentioned in http://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$ in spark sql? when I use like Select last(column) from tablename I am getting error like 15/07/27 03:00:00 INFO exec.FunctionRegistry: Unable to lookup

spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread Jan Zikeš
Hi, I would like to ask if it is currently possible to use spark-ec2 script together with credentials that are consisting not only from: aws_access_key_id and aws_secret_access_key, but it also contains aws_security_token. When I try to run the script I am getting following error message:

Re: suggest coding platform

2015-07-27 Thread Akhil Das
How about IntelliJ? It also has a Terminal tab. Thanks Best Regards On Fri, Jul 24, 2015 at 6:06 PM, saif.a.ell...@wellsfargo.com wrote: Hi all, I tried Notebook Incubator Zeppelin, but I am not completely happy with it. What do you people use for coding? Anything with auto-complete,

Re: RDD[Future[T]] = Future[RDD[T]]

2015-07-27 Thread Ayoub
do you mean something like this ? val values = rdd.mapPartitions{ i: Iterator[Future[T]] = val future: Future[Iterator[T]] = Future sequence i Await result (future, someTimeout) } Where is the blocking happening in this case? It seems to me that all the workers will be blocked until the

Re: Encryption on RDDs or in-memory on Apache Spark

2015-07-27 Thread Akhil Das
Have a look at the current security support https://spark.apache.org/docs/latest/security.html, Spark does not have any encryption support for objects in memory out of the box. But if your concern is to protect the data being cached in memory, then you can easily encrypt your objects in memory

Re: Functions in Spark SQL

2015-07-27 Thread fightf...@163.com
Hi, there I test with sqlContext.sql(select funcName(param1,param2,...) from tableName ) just worked fine. Would you like to paste your test code here ? And which version of Spark are u using ? Best, Sun. fightf...@163.com From: vinod kumar Date: 2015-07-27 15:04 To: User Subject:

Re: spark as a lookup engine for dedup

2015-07-27 Thread Shushant Arora
its for 1 day events in range of 1 billions and processing is in streaming application of ~10-15 sec interval so lookup should be fast. RDD need to be updated with new events and old events of current time-24 hours back should be removed at each processing. So is spark RDD not fit for this

Re: Spark - Eclipse IDE - Maven

2015-07-27 Thread Akhil Das
You can follow this doc https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup Thanks Best Regards On Fri, Jul 24, 2015 at 10:56 AM, Siva Reddy ksiv...@gmail.com wrote: Hi All, I am trying to setup the Eclipse (LUNA) with Maven so that I

Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
RDD is immutable, it cannot be changed, you can only create a new one from data or from transformation. It sounds inefficient to create one each 15 sec for the last 24 hours. I think a key-value store will be much more fitted for this purpose. On Mon, Jul 27, 2015 at 11:21 AM Shushant Arora

Re: ERROR TaskResultGetter: Exception while getting task result when reading avro files that contain arrays

2015-07-27 Thread Akhil Das
Its a serialization error with nested schema i guess. You can look at the twitters chill avro serializer library. Here's two discussion on the same: - https://issues.apache.org/jira/browse/SPARK-3447 -

Re: java.lang.NoSuchMethodError for list.toMap.

2015-07-27 Thread Akhil Das
Whats in your build.sbt? You could be messing with the scala version it seems. Thanks Best Regards On Fri, Jul 24, 2015 at 2:15 AM, Dan Dong dongda...@gmail.com wrote: Hi, When I ran with spark-submit the following simple Spark program of: import org.apache.spark.SparkContext._ import

hive.contrib.serde2.RegexSerDe not found

2015-07-27 Thread ZhuGe
Hi all:I am testing the performance of hive on spark sql.The existing table is created with ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'input.regex' =

Re: spark as a lookup engine for dedup

2015-07-27 Thread Romi Kuntsman
What the throughput of processing and for how long do you need to remember duplicates? You can take all the events, put them in an RDD, group by the key, and then process each key only once. But if you have a long running application where you want to check that you didn't see the same value

Re: spark dataframe gc

2015-07-27 Thread Akhil Das
This spark.shuffle.sort.bypassMergeThreshold might help, You could also try setting the shuffle manager to hash from sort. You can see more configuration options from here https://spark.apache.org/docs/latest/configuration.html#shuffle-behavior. Thanks Best Regards On Fri, Jul 24, 2015 at 3:33

spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread jan.zikes
Hi,   I would like to ask if it is currently possible to use spark-ec2 script together with credentials that are consisting not only from: aws_access_key_id and aws_secret_access_key, but it also contains aws_security_token.   When I try to run the script I am getting following error message:  

Re: RDD[Future[T]] = Future[RDD[T]]

2015-07-27 Thread Nick Pentreath
In this case, each partition will block until the futures in that partition are completed. If you are in the end collecting all the Futures to the driver, what is the reasoning behind using an RDD? You could just use a bunch of Futures directly. If you want to do some processing on the results

Re: Functions in Spark SQL

2015-07-27 Thread vinod kumar
Hi, Select last(product) from sampleTable Spark Version 1.3 -Vinod On Mon, Jul 27, 2015 at 3:48 AM, fightf...@163.com fightf...@163.com wrote: Hi, there I test with sqlContext.sql(select funcName(param1,param2,...) from tableName ) just worked fine. Would you like to paste your test

Re: ERROR SparkUI: Failed to bind SparkUI java.net.BindException: Address already in use: Service 'SparkUI' failed after 16 retries!

2015-07-27 Thread Akhil Das
For each of your job, you can pass spark.ui.port to bind to a different port. Thanks Best Regards On Fri, Jul 24, 2015 at 7:49 PM, Joji John jj...@ebates.com wrote: Thanks Ajay. The way we wrote our spark application is that we have a generic python code, multiple instances of which can

Re: use S3-Compatible Storage with spark

2015-07-27 Thread Akhil Das
So you are able to access your AWS S3 with s3a now? What is the error that you are getting when you try to access the custom storage with fs.s3a.endpoint? Thanks Best Regards On Mon, Jul 27, 2015 at 2:44 PM, Schmirr Wurst schmirrwu...@gmail.com wrote: I was able to access Amazon S3, but for

Why the length of each task varies

2015-07-27 Thread Gavin Liu
I am implementing wordcount on the spark cluster (1 master, 3 slaves) in standalone mode. I have 546G data, and the dfs.blocksize I set is 256MB. Therefore, the amount of tasks are 2186. My 3 slaves each uses 22 cores and 72 memory to do the processing, so the computing ability of each slave

RE: unserialize error in sparkR

2015-07-27 Thread Sun, Rui
Hi, Do you mean you are running the script with https://github.com/amplab-extras/SparkR-pkg and spark 1.2? I am afraid that currently there is no development effort and support on the SparkR-pkg since it has been integrated into Spark since Spark 1.4. Unfortunately, the RDD API and RDD-like

Re: use S3-Compatible Storage with spark

2015-07-27 Thread Schmirr Wurst
I was able to access Amazon S3, but for some reason, the Endpoint parameter is ignored, and I'm not able to access to storage from my provider... : sc.hadoopConfiguration.set(fs.s3a.endpoint,test) sc.hadoopConfiguration.set(fs.s3a.awsAccessKeyId,)