Map Reduce -v- Parallelism

2020-10-14 Thread Hulio andres
Hi,   Is this guy a silly billy for comparing  Apache Flink with Apache Spark ?   https://www.youtube.com/watch?v=sYlbD_OoHhs Airbus makes more of the sky with Flink - Jesse Anderson & Hassene Ben Salem Does Apache Spark tomcat hadoop spark support distributed as well as map re

Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread lalit sharma
DataFrame. So no mapReduce , spark intelligently uses needed pieces from Hive and use its own execution engine. --Regards, Lalit On Wed, Jun 8, 2016 at 9:59 PM, Vikash Pareek <vikash.par...@infoobjects.com > wrote: > Himanshu, > > Spark doesn't use hive execution engine (Map Red

Re: When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Vikash Pareek
Himanshu, Spark doesn't use hive execution engine (Map Reduce) to execute query. Spark only reads the meta data from hive meta store db and executes the query within Spark execution engine. This meta data is used by Spark's own SQL execution engine (this includes components such as catalyst

When queried through hiveContext, does hive executes these queries using its execution engine (default is map-reduce), or spark just reads the data and performs those queries itself?

2016-06-08 Thread Himanshu Mehra
the results to spark? In this case, might hive be using map-reduce to execute the queries? Please clarify this confusion. I have looked into the code seems like spark is just fetching the data from hdfs. Please convince me otherwise. Thanks Best -- View this message in context: http://apache

DIMSUM among 550k objects on AWS Elastic Map Reduce fails with OOM errors

2016-05-27 Thread nmoretto
Hello everyone, I am trying to compute the similarity between 550k objects using the DIMSUM algorithm available in Spark 1.6. The cluster runs on AWS Elastic Map Reduce and consists of 6 r3.2xlarge instances (one master and five cores), having 8 vCPU and 61 GiB of RAM each. My input data

Simple Map Reduce taking lot of time

2015-07-29 Thread Varadharajan Mukundan
Hi All, I'm running Spark 1.4.1 on a 8 core machine with 16 GB RAM. I've a 500MB CSV file with 10 columns and i'm need of separating it into multiple CSV/Parquet files based on one of the fields in the CSV file. I've loaded the CSV file using spark-csv and applied the below transformations. It

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-16 Thread Akhil Das
You can also look into https://spark.apache.org/docs/latest/tuning.html for performance tuning. Thanks Best Regards On Mon, Jun 15, 2015 at 10:28 PM, Rex X dnsr...@gmail.com wrote: Thanks very much, Akhil. That solved my problem. Best, Rex On Mon, Jun 15, 2015 at 2:16 AM, Akhil Das

Re: How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-15 Thread Akhil Das
Something like this? val huge_data = sc.textFile(/path/to/first.csv).map(x = (x.split(\t)(1), x.split(\t)(0)) val gender_data = sc.textFile(/path/to/second.csv),map(x = (x.split(\t)(0), x)) val joined_data = huge_data.join(gender_data) joined_data.take(1000) Its scala btw, python api should

How to use spark for map-reduce flow to filter N columns, top M rows of all csv files under a folder?

2015-06-12 Thread Rex X
To be concrete, say we have a folder with thousands of tab-delimited csv files with following attributes format (each csv file is about 10GB): idnameaddresscity... 1Mattadd1LA... 2Willadd2LA... 3Lucyadd3SF... ... And we have a

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
in context: http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
anyone know how can I force Spark to use only the disk when doing a simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html http://apache-spark-user-list

map - reduce only with disk

2015-06-01 Thread octavian.ganea
Dear all, Does anyone know how can I force Spark to use only the disk when doing a simple flatMap(..).groupByKey.reduce(_ + _) ? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/map-reduce-only-with-disk-tp23102.html Sent from the Apache Spark

map reduce ?

2015-05-21 Thread Yasemin Kaya
Hi, I have JavaPairRDDString, ListInteger and as an example what I want to get. user_id cat1 cat2 cat3 cat4 522 0 1 2 0 62 1 0 3 0 661 1 2 0 1 query : the users who have a number (except 0) in cat1 and cat3 column answer: cat2 - 522,611 cat3-522,62 = user 522 How can I

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Ted Yu
Please take a look at https://issues.apache.org/jira/browse/PHOENIX-1815 On Mon, Apr 20, 2015 at 10:11 AM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks for reply. Does phoenix using inside Spark will be useful? what is the best way to bring data from Hbase into Spark in terms

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread ayan guha
I think recommended use will be creating a dataframe using hbase as source. Then you can run any SQL on that DF. In 1.2 you can create a base rdd and then apply schema in the same manner On 21 Apr 2015 03:12, Jeetendra Gangele gangele...@gmail.com wrote: Thanks for reply. Does phoenix using

Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Jeetendra Gangele
HI All, I am Querying Hbase and combining result and using in my spake job. I am querying hbase using Hbase client api inside my spark job. can anybody suggest me will Spark SQl will be fast enough and provide range of queries? Regards Jeetendra

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Jeetendra Gangele
Thanks for reply. Does phoenix using inside Spark will be useful? what is the best way to bring data from Hbase into Spark in terms performance of application? Regards Jeetendra On 20 April 2015 at 20:49, Ted Yu yuzhih...@gmail.com wrote: To my knowledge, Spark SQL currently doesn't provide

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Ted Yu
To my knowledge, Spark SQL currently doesn't provide range scan capability against hbase. Cheers On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com wrote: HI All, I am Querying Hbase and combining result and using in my spake job. I am querying hbase using Hbase

Re: randomSplit instead of a huge map reduce ?

2015-02-21 Thread Krishna Sankar
') and move on to the next part. Can anyone please explain me which solution is better? Thank you very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User

randomSplit instead of a huge map reduce ?

2015-02-20 Thread shlomib
very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: randomSplit instead of a huge map reduce ?

2015-02-20 Thread Ashish Rangole
you very much, Shlomi. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/randomSplit-instead-of-a-huge-map-reduce-tp21744.html Sent from the Apache Spark User List mailing list archive at Nabble.com

HBase Thrift API Error on map/reduce functions

2015-01-30 Thread mtheofilos
that works in that stuff tell me if that problem can be fixed? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/HBase-Thrift-API-Error-on-map-reduce-functions-tp21439.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
am missing out here? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread Christopher Nguyen
pairs? Can this be done in a distributed manner, as this data set is going to have a few million records? Can we do this in map/reduce commands? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel

Re: Can this be done in map-reduce technique (in parallel)

2014-06-05 Thread lmk
Hi Cheng, Thanks a lot. That solved my problem. Thanks again for the quick response and solution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7047.html Sent from the Apache Spark User List mailing

Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
this in map/reduce commands? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-done-in-map-reduce-technique-in-parallel-tp6905.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Oleg Proudnikov
It is possible if you use a cartesian product to produce all possible pairs for each IP address and 2 stages of map-reduce: - first by pairs of points to find the total of each pair and - second by IP address to find the pair for each IP address with the maximum count. Oleg On 4 June 2014

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread Andrew Ash
...@gmail.com wrote: It is possible if you use a cartesian product to produce all possible pairs for each IP address and 2 stages of map-reduce: - first by pairs of points to find the total of each pair and - second by IP address to find the pair for each IP address with the maximum count. Oleg

Re: Can this be done in map-reduce technique (in parallel)

2014-06-04 Thread lmk
on for all combinations. This is where I get stuck. Please guide me on this. Thanks Again. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-this-be-handled-in-map-reduce-using-RDDs-tp6905p7016.html Sent from the Apache Spark User List mailing list archive