Re: Spark for offline log processing/querying

2016-05-22 Thread Sonal Goyal
Hi Mat, I think you could also use spark SQL to query the logs. Hope the following link helps https://databricks.com/blog/2014/09/23/databricks-reference-applications.html On May 23, 2016 10:59 AM, "Mat Schaffer" wrote: > I'm curious about trying to use spark as a cheap/slow

Re:How spark depends on Guava

2016-05-22 Thread Todd
Can someone please take alook at my question?I am spark-shell local mode and yarn-client mode.Spark code uses guava library,spark should have guava in place during run time. Thanks. At 2016-05-23 11:48:58, "Todd" wrote: Hi, In the spark code, guava maven dependency

Spark for offline log processing/querying

2016-05-22 Thread Mat Schaffer
I'm curious about trying to use spark as a cheap/slow ELK (ElasticSearch,Logstash,Kibana) system. Thinking something like: - instances rotate local logs - copy rotated logs to s3 (s3://logs/region/grouping/instance/service/*.logs) - spark to convert from raw text logs to parquet - maybe presto to

Re: Handling Empty RDD

2016-05-22 Thread Yogesh Vyas
Hi, I finally got it working. I was using the updateStateByKey() function to maintain the previous value of the state, and I found that the event list was empty. Hence handling the empty event list by using event.isEmtpy() sort out the problem. On Sun, May 22, 2016 at 7:59 PM, Ted Yu

Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Timur Shenkao
Hi, Thanks a lot for such interesting comparison. But important questions remain / to be addressed: 1) How to make 2 versions of Spark live together on the same cluster (libraries clash, paths, etc.) ? Most of the Spark users perform ETL, ML operations on Spark as well. So, we may have 3 Spark

How spark depends on Guava

2016-05-22 Thread Todd
Hi, In the spark code, guava maven dependency scope is provided, my question is, how spark depends on guava during runtime? I looked into the spark-assembly-1.6.1-hadoop2.6.1.jar,and didn't find class entries like com.google.common.base.Preconditions etc...

Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-22 Thread Mich Talebzadeh
Hi, I have done a number of extensive tests using Spark-shell with Hive DB and ORC tables. Now one issue that we typically face is and I quote: Spark is fast as it uses Memory and DAG. Great but when we save data it is not fast enough OK but there is a solution now. If you use Spark with

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
Whatever you do the lion share of time is going to be taken by insert into Hive table. Ok check this. It is CSV files inserted into Hive ORC table. This version uses Hive on Spark engine and it is written in Hive executed via beeline --1 Move .CSV data into HDFS: --2 Create an external table.

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am doing the 1. currently using the following and it takes a lot of time. Whats the advantage of doing 2 and how to do it? sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING) stored as ORC LOCATION

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
two alternatives for this ETL or ELT 1. There is only one external ORC table and you do insert overwrite into that external table through Spark sql 2. or 3. 14k files loaded into staging area/read directory and then insert overwrite into an ORC table and th Dr Mich Talebzadeh

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
Around 14000 partitions need to be loaded every hour. Yes, I tested this and its taking a lot of time to load. A partition would look something like the following which is further partitioned by userId with all the userRecords for that date inside it. 5 2016-05-20 16:03

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
by partition do you mean 14000 files loaded in each batch session (say daily)?. Have you actually tested this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
The data is not very big. Say 1MB-10 MB at the max per partition. What is the best way to insert this 14k partitions with decent performance? On Sun, May 22, 2016 at 12:18 PM, Mich Talebzadeh wrote: > the acid question is how many rows are you going to insert in a

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
the acid question is how many rows are you going to insert in a batch session? btw if this is purely an sql operation then you can do all that in hive running on spark engine. It will be very fast as well. Dr Mich Talebzadeh LinkedIn *

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Jörn Franke
14000 partitions seem to be way too many to be performant (except for large data sets). How much data does one partition contain? > On 22 May 2016, at 09:34, SRK wrote: > > Hi, > > In my Spark SQL query to insert data, I have around 14,000 partitions of > data which

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Sabarish Sasidharan
Can't you just reduce the amount of data you insert by applying a filter so that only a small set of idpartitions is selected. You could have multiple such inserts to cover all idpartitions. Does that help? Regards Sab On 22 May 2016 1:11 pm, "swetha kasireddy" wrote:

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
So, if I put 1000 records at a time and if the next 1000 records have some records that has same partition as the previous records then the data will be overwritten. How can I prevent overwriting valid data in this case? Could you post the example that you are talking about? What I am doing is

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
ok is the staging table used as staging only. you can create a staging *directory^ where you put your data there (you can put 100s of files there) and do an insert/select that will take data from 100 files into your main ORC table. I have an example of 100's of CSV files insert/select from a

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
But, how do I take 100 partitions at a time from staging table? On Sun, May 22, 2016 at 11:26 AM, Mich Talebzadeh wrote: > ok so you still keep data as ORC in Hive for further analysis > > what I have in mind is to have an external table as staging table and do >

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread swetha kasireddy
I am looking at ORC. I insert the data using the following query. sqlContext.sql(" CREATE EXTERNAL TABLE IF NOT EXISTS records (id STRING, record STRING) PARTITIONED BY (datePartition STRING, idPartition STRING) stored as ORC LOCATION '/user/users' ") sqlContext.sql(" orc.compress=

Re: Hive 2 Metastore Entity-Relationship Diagram, Base tables

2016-05-22 Thread Mich Talebzadeh
for now to be used as a quick reference for hive metadata tables, columns, pk and constraint. It only covers the base tables excluding transactional add ons in hive-txn-schema-2.0.0.oracle.sql HTH Dr Mich Talebzadeh LinkedIn *

Re: Handling Empty RDD

2016-05-22 Thread Ted Yu
You mean when rdd.isEmpty() returned false, saveAsTextFile still produced empty file ? Can you show code snippet that demonstrates this ? Cheers On Sun, May 22, 2016 at 5:17 AM, Yogesh Vyas wrote: > Hi, > I am reading files using textFileStream, performing some action

Handling Empty RDD

2016-05-22 Thread Yogesh Vyas
Hi, I am reading files using textFileStream, performing some action onto it and then saving it to HDFS using saveAsTextFile. But whenever there is no file to read, Spark will write and empty RDD( [] ) to HDFS. So, how to handle the empty RDD. I checked rdd.isEmpty() and rdd.count>0, but both of

How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102
Hi I have Spark job which does group by and I cant avoid it because of my use case. I have large dataset around 1 TB which I need to process/update in DataFrame. Now my jobs shuffles huge data and slows things because of shuffling and groupby. One reason I see is my data is skew some of my group

Unsubscribe

2016-05-22 Thread Shekhar Kumar
Please Unsubscribe - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

How to map values read from text file to 2 different set of RDDs

2016-05-22 Thread Deepak Sharma
Hi I am reading a text file with 16 fields. All the place holders for the values of this text file has been defined in say 2 different case classes: Case1 and Case2 How do i map values read from text file , so my function in scala should be able to return 2 different RDDs , with each each RDD of

Re: unsubscribe

2016-05-22 Thread junius zhou
unsubscribe On Tue, May 17, 2016 at 5:57 PM, aruna jakhmola wrote: > >

Re: How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread Mich Talebzadeh
where is your base table and what format is it Parquet, ORC etc) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

How to insert data for 100 partitions at a time using Spark SQL

2016-05-22 Thread SRK
Hi, In my Spark SQL query to insert data, I have around 14,000 partitions of data which seems to be causing memory issues. How can I insert the data for 100 partitions at a time to avoid any memory issues? -- View this message in context:

Re: What / Where / When / How questions in Spark 2.0 ?

2016-05-22 Thread Amit Sela
I need to update this ;) To start with, you could just take a look at branch-2.0. On Sun, May 22, 2016, 01:23 Ovidiu-Cristian MARCU < ovidiu-cristian.ma...@inria.fr> wrote: > Thank you, Amit! I was looking for this kind of information. > > I did not fully read your paper, I see in it a TODO with