Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Jörn Franke
I agree with the others that a dedicated NoSQL datastore can make sense. You should look at the lambda architecture paradigm. Keep in mind that more memory does not necessarily mean more performance. It is the right data structure for the queries of your users. Additionally, if your queries

Re: How do I deal with ever growing application log

2017-03-05 Thread Noorul Islam Kamal Malmiyoda
Or you could use sinks like elasticsearch. Regards, Noorul On Mon, Mar 6, 2017 at 10:52 AM, devjyoti patra wrote: > Timothy, why are you writing application logs to HDFS? In case you want to > analyze these logs later, you can write to local storage on your slave nodes > and

Re: How do I deal with ever growing application log

2017-03-05 Thread devjyoti patra
Timothy, why are you writing application logs to HDFS? In case you want to analyze these logs later, you can write to local storage on your slave nodes and later rotate those files to a suitable location. If they are only going to useful for debugging the application, you can always remove them

FPGrowth Model is taking too long to generate frequent item sets

2017-03-05 Thread Raju Bairishetti
Hi, I am new to Spark ML Lib. I am using FPGrowth model for finding related items. Number of transactions are 63K and the total number of items in all transactions are 200K. I am running FPGrowth model to generate frequent items sets. It is taking huge amount of time to generate frequent

How do I deal with ever growing application log

2017-03-05 Thread Timothy Chan
I'm running a single worker EMR cluster for a Structured Streaming job. How do I deal with my application log filling up HDFS? /var/log/spark/apps/application_1487823545416_0021_1.inprogress is currently 21.8 GB *Sent with Shift

Kafka failover with multiple data centers

2017-03-05 Thread nguyen duc Tuan
Hi everyone, We are deploying kafka cluster for ingesting streaming data. But sometimes, some of nodes on the cluster have troubles (node dies, kafka daemon is killed...). However, Recovering data in Kafka can be very slow. It takes serveral hours to recover from disaster. I saw a slide here

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread ayan guha
Any specific reason to choose Spark? It sounds like you have a Write-Once-Read-Many Times dataset, which is logically partitioned across customers, sitting in some data store. And essentially you are looking for a fast way to access it, and most likely you will use the same partition key for

[Spark Streamiing] Streaming job failing consistently after 1h

2017-03-05 Thread Charles O. Bajomo
Hello all, I have a strange behaviour I can't understand. I have a streaming job using a custom java receiver that pull data from a jms queue that I process and then write to HDFS as parquet and avro files. For some reason my job keeps failing after 1hr and 30 minutes. When It fails I get an

Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Subhash Sriram
Hi Allan, Where is the data stored right now? If it's in a relational database, and you are using Spark with Hadoop, I feel like it would make sense to move the import the data into HDFS, just because it would be faster to access the data. You could use Sqoop to do that. In terms of having a

Re: [ANNOUNCE] Apache Bahir 2.1.0 Released

2017-03-05 Thread kant kodali
How about HTTP2/REST connector for Spark? Is that something we can expect? Thanks! On Wed, Feb 22, 2017 at 4:07 AM, Christian Kadner wrote: > The Apache Bahir community is pleased to announce the release > of Apache Bahir 2.1.0 which provides the following extensions for >

Spark Beginner: Correct approach for use case

2017-03-05 Thread Allan Richards
Hi, I am looking to use Spark to help execute queries against a reasonably large dataset (1 billion rows). I'm a bit lost with all the different libraries / add ons to Spark, and am looking for some direction as to what I should look at / what may be helpful. A couple of relevant points: - The

Re: pyspark cluster mode on standalone deployment

2017-03-05 Thread Ofer Eliassaf
anyone? please? is this getting any priority? On Tue, Sep 27, 2016 at 3:38 PM, Ofer Eliassaf wrote: > Is there any plan to support python spark running in "cluster mode" on a > standalone deployment? > > There is this famous survey mentioning that more than 50% of the

Re: spark jobserver

2017-03-05 Thread Noorul Islam K M
A better forum would be https://groups.google.com/forum/#!forum/spark-jobserver or https://gitter.im/spark-jobserver/spark-jobserver Regards, Noorul Madabhattula Rajesh Kumar writes: > Hi, > > I am getting below an exception when I start the job-server > >

spark jobserver

2017-03-05 Thread Madabhattula Rajesh Kumar
Hi, I am getting below an exception when I start the job-server ./server_start.sh: line 41: kill: (11482) - No such process Please let me know how to resolve this error Regards, Rajesh

unsubscribe

2017-03-05 Thread Howard Chen

Re: Sharing my DataFrame (DataSet) cheat sheet.

2017-03-05 Thread Yan Facai
Thanks, very useful! On Sun, Mar 5, 2017 at 4:55 AM, Yuhao Yang wrote: > > Sharing some snippets I accumulated during developing with Apache Spark > DataFrame (DataSet). Hope it can help you in some way. > > https://github.com/hhbyyh/DataFrameCheatSheet. > > [image: 内嵌图片 1] >

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread ayan guha
Just as best practice, dataframe and datasets are preferred way, so try not to resort to rdd unless you absolutely have to... On Sun, 5 Mar 2017 at 7:10 pm, khwunchai jaengsawang wrote: > Hi Old-Scool, > > > For the first question, you can specify the number of partition in

Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread khwunchai jaengsawang
Hi Old-Scool, For the first question, you can specify the number of partition in any DataFrame by using repartition(numPartitions: Int, partitionExprs: Column*). Example: val partitioned = data.repartition(numPartitions=10).cache() For your second question, you can transform your RDD