Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Jörn Franke
I agree with the others that a dedicated NoSQL datastore can make sense. You 
should look at the lambda architecture paradigm. Keep in mind that more memory 
does not necessarily mean more performance. It is the right data structure for  
the queries of your users. Additionally, if your queries are executed over the 
whole dataset and you want to have answer times in 2 seconds, you should look 
at databases that do aggregations on samples of the data (cf. 
https://jornfranke.wordpress.com/2015/06/28/big-data-what-is-next-oltp-olap-predictive-analytics-sampling-and-probabilistic-databases).
 E.g. Hive has a tablesample functionality since a long time.

> On 5 Mar 2017, at 21:49, Allan Richards  wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large 
> dataset (1 billion rows). I'm a bit lost with all the different libraries / 
> add ons to Spark, and am looking for some direction as to what I should look 
> at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's 
> more complicated than a single SQL query) that I want to run against it, but 
> the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will generally 
> consist of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So 
> ideally I want to keep all the records in memory, but distributed over the 
> different nodes in the cluster. Does this mean sharing a SparkContext between 
> queries, or is this where HDFS comes in, or is there something else that 
> would be better suited?
> 
> Or is there another overall approach I should look into for executing queries 
> in "real time" against a dataset this size?
> 
> Thanks,
> Allan.


Re: How do I deal with ever growing application log

2017-03-05 Thread Noorul Islam Kamal Malmiyoda
Or you could use sinks like elasticsearch.

Regards,
Noorul

On Mon, Mar 6, 2017 at 10:52 AM, devjyoti patra  wrote:
> Timothy, why are you writing application logs to HDFS? In case you want to
> analyze these logs later, you can write to local storage on your slave nodes
> and later rotate those files to a suitable location. If they are only going
> to useful for debugging the application, you can always remove them
> periodically.
> Thanks,
> Dev
>
> On Mar 6, 2017 9:48 AM, "Timothy Chan"  wrote:
>>
>> I'm running a single worker EMR cluster for a Structured Streaming job.
>> How do I deal with my application log filling up HDFS?
>>
>> /var/log/spark/apps/application_1487823545416_0021_1.inprogress
>>
>> is currently 21.8 GB
>>
>> Sent with Shift

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How do I deal with ever growing application log

2017-03-05 Thread devjyoti patra
Timothy, why are you writing application logs to HDFS? In case you want to
analyze these logs later, you can write to local storage on your slave
nodes and later rotate those files to a suitable location. If they are only
going to useful for debugging the application, you can always remove them
periodically.
Thanks,
Dev

On Mar 6, 2017 9:48 AM, "Timothy Chan"  wrote:

> I'm running a single worker EMR cluster for a Structured Streaming job.
> How do I deal with my application log filling up HDFS?
>
> /var/log/spark/apps/application_1487823545416_0021_1.inprogress
>
> is currently 21.8 GB
>
> *Sent with Shift
> *
>


FPGrowth Model is taking too long to generate frequent item sets

2017-03-05 Thread Raju Bairishetti
Hi,
  I am new to Spark ML Lib. I am using FPGrowth model for finding related
items.

Number of transactions are 63K and the total number of items in all
transactions are 200K.

I am running FPGrowth model to generate frequent items sets. It is taking
huge amount of time to generate frequent itemsets.* I am setting
min-support value such that each item appears in at least ~(number of
items)/(number of transactions).*

It is taking lots of time in case If I say item can appear at least once in
the database.

If I give higher value to min-support then output is very smaller.

Could anyone please guide me how to reduce the execution time for
generating frequent items?

--
Thanks,
Raju Bairishetti,
www.lazada.com


How do I deal with ever growing application log

2017-03-05 Thread Timothy Chan
I'm running a single worker EMR cluster for a Structured Streaming job. How
do I deal with my application log filling up HDFS?

/var/log/spark/apps/application_1487823545416_0021_1.inprogress

is currently 21.8 GB

*Sent with Shift
*


Kafka failover with multiple data centers

2017-03-05 Thread nguyen duc Tuan
Hi everyone,
We are deploying kafka cluster for ingesting streaming data. But sometimes,
some of nodes on the cluster have troubles (node dies, kafka daemon is
killed...). However, Recovering data in Kafka can be very slow. It takes
serveral hours to recover from disaster. I saw a slide here suggesting
using multiple data centers (
https://www.slideshare.net/HadoopSummit/building-largescale-stream-infrastructures-across-multiple-data-centers-with-apache-kafka).
But I wonder, how can we detect the problem and switch between datacenters
in Spark Streaming? Since kafka 0.10.1 support timestamp index, how can
seek to right offsets?
Are there any opensource library out there that supports handling the
problem on the fly?
Thanks.


Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread ayan guha
Any specific reason to choose Spark? It sounds like you have a
Write-Once-Read-Many Times dataset, which is logically partitioned across
customers, sitting in some data store. And essentially you are looking for
a fast way to access it, and most likely you will use the same partition
key for quering the data. This is more of a database/NoSQL kind of use case
than Spark (which is more of distributed processing engine,I reckon).

On Mon, Mar 6, 2017 at 11:56 AM, Subhash Sriram 
wrote:

> Hi Allan,
>
> Where is the data stored right now? If it's in a relational database, and
> you are using Spark with Hadoop, I feel like it would make sense to move
> the import the data into HDFS, just because it would be faster to access
> the data. You could use Sqoop to do that.
>
> In terms of having a long running Spark context, you could look into the
> Spark job server:
>
> https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md
>
> It would allow you to cache all the data in memory and then accept queries
> via REST API calls. You would have to refresh your cache as the data
> changes of course, but it sounds like that is not very often.
>
> In terms of running the queries themselves, I would think you could use
> Spark SQL and the DataFrame/DataSet API, which is built into Spark. You
> will have to think about the best way to partition your data, depending on
> the queries themselves.
>
> Here is a link to the Spark SQL docs:
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html
>
> I hope that helps, and I'm sure other folks will have some helpful advice
> as well.
>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Mar 5, 2017, at 3:49 PM, Allan Richards 
> wrote:
>
> Hi,
>
> I am looking to use Spark to help execute queries against a reasonably
> large dataset (1 billion rows). I'm a bit lost with all the different
> libraries / add ons to Spark, and am looking for some direction as to what
> I should look at / what may be helpful.
>
> A couple of relevant points:
>  - The dataset doesn't change over time.
>  - There are a small number of applications (or queries I guess, but it's
> more complicated than a single SQL query) that I want to run against it,
> but the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will
> generally consist of 1-5000 rows.
>
> I want each query to run as fast as possible (less than a second or two).
> So ideally I want to keep all the records in memory, but distributed over
> the different nodes in the cluster. Does this mean sharing a SparkContext
> between queries, or is this where HDFS comes in, or is there something else
> that would be better suited?
>
> Or is there another overall approach I should look into for executing
> queries in "real time" against a dataset this size?
>
> Thanks,
> Allan.
>
>


-- 
Best Regards,
Ayan Guha


[Spark Streamiing] Streaming job failing consistently after 1h

2017-03-05 Thread Charles O. Bajomo
Hello all, 

I have a strange behaviour I can't understand. I have a streaming job using a 
custom java receiver that pull data from a jms queue that I process and then 
write to HDFS as parquet and avro files. For some reason my job keeps failing 
after 1hr and 30 minutes. When It fails I get an error saying the "container is 
running beyond physical memory limits. Current Usage 4.5GB of 4.5GB physical 
memory used. 6.4GB of 9.4GB virtual memory used. ". to be honest I don;t 
understand the error, What are the memory limits shown in the error referring 
to? I allocated 10 executors with 6 cores each and 4G of executor and driver 
memory. I set the overhead memory to 2.8G, so the values don't add up. 

Anyone have any idea what the error is referring? I have increased the memory 
and i didn't help, it appears it just bought me more time. 

Thanks. 


Re: Spark Beginner: Correct approach for use case

2017-03-05 Thread Subhash Sriram
Hi Allan,

Where is the data stored right now? If it's in a relational database, and you 
are using Spark with Hadoop, I feel like it would make sense to move the import 
the data into HDFS, just because it would be faster to access the data. You 
could use Sqoop to do that.

In terms of having a long running Spark context, you could look into the Spark 
job server:

https://github.com/spark-jobserver/spark-jobserver/blob/master/README.md

It would allow you to cache all the data in memory and then accept queries via 
REST API calls. You would have to refresh your cache as the data changes of 
course, but it sounds like that is not very often.

In terms of running the queries themselves, I would think you could use Spark 
SQL and the DataFrame/DataSet API, which is built into Spark. You will have to 
think about the best way to partition your data, depending on the queries 
themselves.

Here is a link to the Spark SQL docs:

http://spark.apache.org/docs/latest/sql-programming-guide.html

I hope that helps, and I'm sure other folks will have some helpful advice as 
well.

Thanks,
Subhash 

Sent from my iPhone

> On Mar 5, 2017, at 3:49 PM, Allan Richards  wrote:
> 
> Hi,
> 
> I am looking to use Spark to help execute queries against a reasonably large 
> dataset (1 billion rows). I'm a bit lost with all the different libraries / 
> add ons to Spark, and am looking for some direction as to what I should look 
> at / what may be helpful.
> 
> A couple of relevant points:
>  - The dataset doesn't change over time. 
>  - There are a small number of applications (or queries I guess, but it's 
> more complicated than a single SQL query) that I want to run against it, but 
> the parameters to those queries will change all the time.
>  - There is a logical grouping of the data per customer, which will generally 
> consist of 1-5000 rows.
> 
> I want each query to run as fast as possible (less than a second or two). So 
> ideally I want to keep all the records in memory, but distributed over the 
> different nodes in the cluster. Does this mean sharing a SparkContext between 
> queries, or is this where HDFS comes in, or is there something else that 
> would be better suited?
> 
> Or is there another overall approach I should look into for executing queries 
> in "real time" against a dataset this size?
> 
> Thanks,
> Allan.


Re: [ANNOUNCE] Apache Bahir 2.1.0 Released

2017-03-05 Thread kant kodali
How about HTTP2/REST connector for Spark? Is that something we can expect?

Thanks!

On Wed, Feb 22, 2017 at 4:07 AM, Christian Kadner 
wrote:

> The Apache Bahir community is pleased to announce the release
> of Apache Bahir 2.1.0 which provides the following extensions for
> Apache Spark 2.1.0:
>
>- Akka Streaming
>- MQTT Streaming
>- MQTT Structured Streaming
>- Twitter Streaming
>- ZeroMQ Streaming
>
> For more information about Apache Bahir and to download the
> latest release go to:
>
> http://bahir.apache.org
>
> The Apache Bahir streaming connectors are also available at:
>
> https://spark-packages.org/?q=bahir
>
> ---
> Best regards,
> Christian Kadner
>
>


Spark Beginner: Correct approach for use case

2017-03-05 Thread Allan Richards
Hi,

I am looking to use Spark to help execute queries against a reasonably
large dataset (1 billion rows). I'm a bit lost with all the different
libraries / add ons to Spark, and am looking for some direction as to what
I should look at / what may be helpful.

A couple of relevant points:
 - The dataset doesn't change over time.
 - There are a small number of applications (or queries I guess, but it's
more complicated than a single SQL query) that I want to run against it,
but the parameters to those queries will change all the time.
 - There is a logical grouping of the data per customer, which will
generally consist of 1-5000 rows.

I want each query to run as fast as possible (less than a second or two).
So ideally I want to keep all the records in memory, but distributed over
the different nodes in the cluster. Does this mean sharing a SparkContext
between queries, or is this where HDFS comes in, or is there something else
that would be better suited?

Or is there another overall approach I should look into for executing
queries in "real time" against a dataset this size?

Thanks,
Allan.


Re: pyspark cluster mode on standalone deployment

2017-03-05 Thread Ofer Eliassaf
anyone? please? is this getting any priority?

On Tue, Sep 27, 2016 at 3:38 PM, Ofer Eliassaf 
wrote:

> Is there any plan to support python spark running in "cluster mode" on a
> standalone deployment?
>
> There is this famous survey mentioning that more than 50% of the users are
> using the standalone configuration.
> Making pyspark work in cluster mode with standalone will help a lot for
> high availabilty in python spark.
>
> Cuurently only Yarn deployment supports it. Bringing the huge Yarn
> installation just for this feature is not fun at all
>
> Does someone have time estimation for this?
>
>
>
> --
> Regards,
> Ofer Eliassaf
>



-- 
Regards,
Ofer Eliassaf


Re: spark jobserver

2017-03-05 Thread Noorul Islam K M

A better forum would be

https://groups.google.com/forum/#!forum/spark-jobserver

or

https://gitter.im/spark-jobserver/spark-jobserver

Regards,
Noorul


Madabhattula Rajesh Kumar  writes:

> Hi,
>
> I am getting below an exception when I start the job-server
>
> ./server_start.sh: line 41: kill: (11482) - No such process
>
> Please let me know how to resolve this error
>
> Regards,
> Rajesh

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



spark jobserver

2017-03-05 Thread Madabhattula Rajesh Kumar
Hi,

I am getting below an exception when I start the job-server

./server_start.sh: line 41: kill: (11482) - No such process

Please let me know how to resolve this error

Regards,
Rajesh


unsubscribe

2017-03-05 Thread Howard Chen



Re: Sharing my DataFrame (DataSet) cheat sheet.

2017-03-05 Thread Yan Facai
Thanks,
very useful!

On Sun, Mar 5, 2017 at 4:55 AM, Yuhao Yang  wrote:

>
> Sharing some snippets I accumulated during developing with Apache Spark
> DataFrame (DataSet). Hope it can help you in some way.
>
> https://github.com/hhbyyh/DataFrameCheatSheet.
>
> [image: 内嵌图片 1]
>
>
>
>
>
> Regards,
> Yuhao Yang
>


Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread ayan guha
Just as best practice, dataframe and datasets are preferred way, so try not
to resort to rdd unless you absolutely have to...

On Sun, 5 Mar 2017 at 7:10 pm, khwunchai jaengsawang 
wrote:

> Hi Old-Scool,
>
>
> For the first question, you can specify the number of partition in any
> DataFrame by using
> repartition(numPartitions: Int, partitionExprs: Column*).
> *Example*:
> val partitioned = data.repartition(numPartitions=10).cache()
>
> For your second question, you can transform your RDD into PairRDD and use
> reduceByKey()
> *Example:*
> val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_)
>
>
> Best,
>
>   Khwunchai Jaengsawang
>   *Email*: khwuncha...@ku.th
>   LinkedIn  | Github
> 
>
>
> On Mar 4, 2560 BE, at 8:59 PM, Old-School 
> wrote:
>
> Hi,
>
> I want to perform some simple transformations and check the execution time,
> under various configurations (e.g. number of cores being used, number of
> partitions etc). Since it is not possible to set the partitions of a
> dataframe , I guess that I should probably use RDDs.
>
> I've got a dataset with 3 columns as shown below:
>
> val data = file.map(line => line.split(" "))
>  .filter(lines => lines.length == 3) // ignore first line
>  .map(row => (row(0), row(1), row(2)))
>  .toDF("ID", "word-ID", "count")
> results in:
>
> +--++-+
> | ID |  word-ID   |  count   |
> +--++-+
> |  15   |87  |   151|
> |  20   |19  |   398|
> |  15   |19  |   21  |
> |  180 |90  |   190|
> +---+-+
> So how can I turn the above into an RDD in order to use e.g.
> sc.parallelize(data, 10) and set the number of partitions to say 10?
>
> Furthermore, I would also like to ask about the equivalent expression
> (using
> RDD API) for the following simple transformation:
>
> data.select("word-ID",
> "count").groupBy("word-ID").agg(sum($"count").as("count")).show()
>
>
>
> Thanks in advance
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
> --
Best Regards,
Ayan Guha


Re: [RDDs and Dataframes] Equivalent expressions for RDD API

2017-03-05 Thread khwunchai jaengsawang
Hi Old-Scool,


For the first question, you can specify the number of partition in any 
DataFrame by using repartition(numPartitions: Int, partitionExprs: Column*).
Example:
val partitioned = data.repartition(numPartitions=10).cache()

For your second question, you can transform your RDD into PairRDD and use 
reduceByKey()
Example:
val pairs = data.map(row => (row(1), row(2)).reduceByKey(_+_)


Best,

  Khwunchai Jaengsawang
  Email: khwuncha...@ku.th
  LinkedIn  | Github 



> On Mar 4, 2560 BE, at 8:59 PM, Old-School  
> wrote:
> 
> Hi,
> 
> I want to perform some simple transformations and check the execution time,
> under various configurations (e.g. number of cores being used, number of
> partitions etc). Since it is not possible to set the partitions of a
> dataframe , I guess that I should probably use RDDs. 
> 
> I've got a dataset with 3 columns as shown below:
> 
> val data = file.map(line => line.split(" "))
>  .filter(lines => lines.length == 3) // ignore first line
>  .map(row => (row(0), row(1), row(2)))
>  .toDF("ID", "word-ID", "count")
> results in:
> 
> +--++-+
> | ID |  word-ID   |  count   |
> +--++-+
> |  15   |87  |   151|
> |  20   |19  |   398|
> |  15   |19  |   21  |
> |  180 |90  |   190|
> +---+-+
> So how can I turn the above into an RDD in order to use e.g.
> sc.parallelize(data, 10) and set the number of partitions to say 10? 
> 
> Furthermore, I would also like to ask about the equivalent expression (using
> RDD API) for the following simple transformation:
> 
> data.select("word-ID",
> "count").groupBy("word-ID").agg(sum($"count").as("count")).show()
> 
> 
> 
> Thanks in advance
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-and-Dataframes-Equivalent-expressions-for-RDD-API-tp28455.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>