Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
Is this technique similar to what Kinesis is offering or what Structured 
Streaming is going to have eventually?

Just curious.

Cheers,
Ben

 
> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
> <vincent.gromakow...@gmail.com> wrote:
> 
> I would suggest to code your own Spark thriftserver which seems to be very 
> easy.
> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>  
> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
> 
> I am starting to test it. The big advantage is that you can implement any 
> logic because it's a spark job and then start a thrift server on temporary 
> table. For example you can query a micro batch rdd from a kafka stream, or 
> pre load some tables and implement a rolling cache to periodically update the 
> spark in memory tables with persistent store...
> It's not part of the public API and I don't know yet what are the issues 
> doing this but I think Spark community should look at this path: making the 
> thriftserver be instantiable in any spark job.
> 
> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com 
> <mailto:msegel_had...@hotmail.com>>:
> Guys, 
> Sorry for jumping in late to the game… 
> 
> If memory serves (which may not be a good thing…) :
> 
> You can use HiveServer2 as a connection point to HBase.  
> While this doesn’t perform well, its probably the cleanest solution. 
> I’m not keen on Phoenix… wouldn’t recommend it…. 
> 
> 
> The issue is that you’re trying to make HBase, a key/value object store, a 
> Relational Engine… its not. 
> 
> There are some considerations which make HBase not ideal for all use cases 
> and you may find better performance with Parquet files. 
> 
> One thing missing is the use of secondary indexing and query optimizations 
> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
> performance will vary. 
> 
> With respect to Tableau… their entire interface in to the big data world 
> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
> part of your solution, you’re DOA w respect to Tableau. 
> 
> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
> Apache project) 
> 
> 
>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> Thanks for all the suggestions. It would seem you guys are right about the 
>> Tableau side of things. The reports don’t need to be real-time, and they 
>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>> 
>> I originally thought that we needed two-way data retrieval from the DMP 
>> HBase for ID generation, but after further investigation into the use-case 
>> and architecture, the ID generation needs to happen local to the Ad Servers 
>> where we generate a unique ID and store it in a ID linking table. Even 
>> better, many of the 3rd party services supply this ID. So, data only needs 
>> to flow in one direction. We will use Kafka as the bus for this. No JDBC 
>> required. This is also goes for the REST Endpoints. 3rd party services will 
>> hit ours to update our data with no need to read from our data. And, when we 
>> want to update their data, we will hit theirs to update their data using a 
>> triggered job.
>> 
>> This al boils down to just integrating with Kafka.
>> 
>> Once again, thanks for all the help.
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Oct 9, 2016, at 3:16 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> 
>>> please keep also in mind that Tableau Server has the capabilities to store 
>>> data in-memory and refresh only when needed the in-memory data. This means 
>>> you can import it from any source and let your users work only on the 
>>> in-memory data in Tableau Server.
>>> 
>>> On Sun, Oct 9, 2016 at 9:22 AM, Jörn Franke <jornfra...@gmail.com 
>>> <mailto:jornfra...@gmail.com>> wrote:
>>> Cloudera 5.8 has a very old version of Hive without Tez, but Mich provided 
>>> already a good alternative. However, you should check if it contains a 
>>> recent version of Hbase and Phoenix. That being said, I just wonder what is 
>>> the dataflow, data model and the analysis you plan to do. Maybe there are 
>>> completely different solutions possible. Especially these single inserts, 
>>> upserts etc. should be avoided

Re: Deep learning libraries for scala

2016-10-19 Thread Benjamin Kim
On that note, here is an article that Databricks made regarding using 
Tensorflow in conjunction with Spark.

https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html

Cheers,
Ben


> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta  
> wrote:
> 
> while using Deep Learning you might want to stay as close to tensorflow as 
> possible. There is very less translation loss, you get to access stable, 
> scalable and tested libraries from the best brains in the industry and as far 
> as Scala goes, it helps a lot to think about using the language as a tool to 
> access algorithms in this instance unless you want to start developing 
> algorithms from grounds up ( and in which case you might not require any 
> libraries at all).
> 
> On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty  > wrote:
> Hi,
> 
> Are there any good libraries which can be used for scala deep learning models 
> ?
> How can we integrate tensorflow with scala ML ?
> 



Re: Deep learning libraries for scala

2016-11-01 Thread Benjamin Kim
To add, I see that Databricks has been busy integrating deep learning more into 
their product and put out a new article about this.

https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html 
<https://databricks.com/blog/2016/10/27/gpu-acceleration-in-databricks.html>

An interesting tidbit is at the bottom of the article mentioning TensorFrames.

https://github.com/databricks/tensorframes 
<https://github.com/databricks/tensorframes>

Seems like an interesting direction…

Cheers,
Ben


> On Oct 19, 2016, at 9:05 AM, janardhan shetty <janardhan...@gmail.com> wrote:
> 
> Agreed. But as it states deeper integration with (scala) is yet to be 
> developed. 
> Any thoughts on how to use tensorflow with scala ? Need to write wrappers I 
> think.
> 
> 
> On Oct 19, 2016 7:56 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> On that note, here is an article that Databricks made regarding using 
> Tensorflow in conjunction with Spark.
> 
> https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html
>  
> <https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html>
> 
> Cheers,
> Ben
> 
> 
>> On Oct 19, 2016, at 3:09 AM, Gourav Sengupta <gourav.sengu...@gmail.com 
>> <mailto:gourav.sengu...@gmail.com>> wrote:
>> 
>> while using Deep Learning you might want to stay as close to tensorflow as 
>> possible. There is very less translation loss, you get to access stable, 
>> scalable and tested libraries from the best brains in the industry and as 
>> far as Scala goes, it helps a lot to think about using the language as a 
>> tool to access algorithms in this instance unless you want to start 
>> developing algorithms from grounds up ( and in which case you might not 
>> require any libraries at all).
>> 
>> On Sat, Oct 1, 2016 at 3:30 AM, janardhan shetty <janardhan...@gmail.com 
>> <mailto:janardhan...@gmail.com>> wrote:
>> Hi,
>> 
>> Are there any good libraries which can be used for scala deep learning 
>> models ?
>> How can we integrate tensorflow with scala ML ?
>> 
> 



Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
This will give me an opportunity to start using Structured Streaming. Then, I 
can try adding more functionality. If all goes well, then we could transition 
off of HBase to a more in-memory data solution that can “spill-over” data for 
us.

> On Oct 17, 2016, at 11:53 AM, vincent gromakowski 
> <vincent.gromakow...@gmail.com> wrote:
> 
> Instead of (or additionally to) saving results somewhere, you just start a 
> thriftserver that expose the Spark tables of the SQLContext (or SparkSession 
> now). That means you can implement any logic (and maybe use structured 
> streaming) to expose your data. Today using the thriftserver means reading 
> data from the persistent store every query, so if the data modeling doesn't 
> fit the query it can be quite long.  What you generally do in a common spark 
> job is to load the data and cache spark table in a in-memory columnar table 
> which is quite efficient for any kind of query, the counterpart is that the 
> cache isn't updated you have to implement a reload mechanism, and this 
> solution isn't available using the thriftserver.
> What I propose is to mix the two world: periodically/delta load data in spark 
> table cache and expose it through the thriftserver. But you have to implement 
> the loading logic, it can be very simple to very complex depending on your 
> needs.
> 
> 
> 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>>:
> Is this technique similar to what Kinesis is offering or what Structured 
> Streaming is going to have eventually?
> 
> Just curious.
> 
> Cheers,
> Ben
> 
>  
>> On Oct 17, 2016, at 10:14 AM, vincent gromakowski 
>> <vincent.gromakow...@gmail.com <mailto:vincent.gromakow...@gmail.com>> wrote:
>> 
>> I would suggest to code your own Spark thriftserver which seems to be very 
>> easy.
>> http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server
>>  
>> <http://stackoverflow.com/questions/27108863/accessing-spark-sql-rdd-tables-through-the-thrift-server>
>> 
>> I am starting to test it. The big advantage is that you can implement any 
>> logic because it's a spark job and then start a thrift server on temporary 
>> table. For example you can query a micro batch rdd from a kafka stream, or 
>> pre load some tables and implement a rolling cache to periodically update 
>> the spark in memory tables with persistent store...
>> It's not part of the public API and I don't know yet what are the issues 
>> doing this but I think Spark community should look at this path: making the 
>> thriftserver be instantiable in any spark job.
>> 
>> 2016-10-17 18:17 GMT+02:00 Michael Segel <msegel_had...@hotmail.com 
>> <mailto:msegel_had...@hotmail.com>>:
>> Guys, 
>> Sorry for jumping in late to the game… 
>> 
>> If memory serves (which may not be a good thing…) :
>> 
>> You can use HiveServer2 as a connection point to HBase.  
>> While this doesn’t perform well, its probably the cleanest solution. 
>> I’m not keen on Phoenix… wouldn’t recommend it…. 
>> 
>> 
>> The issue is that you’re trying to make HBase, a key/value object store, a 
>> Relational Engine… its not. 
>> 
>> There are some considerations which make HBase not ideal for all use cases 
>> and you may find better performance with Parquet files. 
>> 
>> One thing missing is the use of secondary indexing and query optimizations 
>> that you have in RDBMSs and are lacking in HBase / MapRDB / etc …  so your 
>> performance will vary. 
>> 
>> With respect to Tableau… their entire interface in to the big data world 
>> revolves around the JDBC/ODBC interface. So if you don’t have that piece as 
>> part of your solution, you’re DOA w respect to Tableau. 
>> 
>> Have you considered Drill as your JDBC connection point?  (YAAP: Yet another 
>> Apache project) 
>> 
>> 
>>> On Oct 9, 2016, at 12:23 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Thanks for all the suggestions. It would seem you guys are right about the 
>>> Tableau side of things. The reports don’t need to be real-time, and they 
>>> won’t be directly feeding off of the main DMP HBase data. Instead, it’ll be 
>>> batched to Parquet or Kudu/Impala or even PostgreSQL.
>>> 
>>> I originally thought that we needed two-way data retrieval from the DMP 
>>> HBase for ID generation, but after further investigation into the use-case 
>>> and architecture, the ID generation needs to happen 

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9

On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com>:

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
them into 1 file after they are output from Spark. Doing a coalesce(1) on
the Spark cluster will not work. It just does not have the resources to do
it. I'm trying to do it using the commandline and not use Spark. I will use
this command in shell script. I tried "hdfs dfs -getmerge", but the file
becomes unreadable by Spark with gzip footer error.





Thanks,


Ben


-


To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t 
work, I’ll try Kite.

Cheers,
Ben


> On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi Benjamin,
> 
> 
> As you might already know, I believe the Hadoop command automatically does 
> not merge the column-based format such as ORC or Parquet but just simply 
> concatenates them.
> 
> I haven't tried this by myself but I remember I saw a JIRA in Parquet - 
> https://issues.apache.org/jira/browse/PARQUET-460 
> <https://issues.apache.org/jira/browse/PARQUET-460>
> 
> It seems parquet-tools allows merge small Parquet files into one. 
> 
> 
> Also, I believe there are command-line tools in Kite - 
> https://github.com/kite-sdk/kite <https://github.com/kite-sdk/kite>
> 
> This might be useful.
> 
> 
> Thanks!
> 
> 2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>>:
> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
> into 1 file after they are output from Spark. Doing a coalesce(1) on the 
> Spark cluster will not work. It just does not have the resources to do it. 
> I'm trying to do it using the commandline and not use Spark. I will use this 
> command in shell script. I tried "hdfs dfs -getmerge", but the file becomes 
> unreadable by Spark with gzip footer error.
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 



Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark 
cluster will not work. It just does not have the resources to do it. I'm trying 
to do it using the commandline and not use Spark. I will use this command in 
shell script. I tried "hdfs dfs -getmerge", but the file becomes unreadable by 
Spark with gzip footer error.

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim
I’m curious about if and when Spark SQL will ever remove its dependency on Hive 
Metastore. Now that Spark 2.1’s SparkSession has superseded the need for 
HiveContext, are there plans for Spark to no longer use the Hive Metastore 
service with a “SparkSchema” service with a PostgreSQL, MySQL, etc. DB backend? 
Hive is growing long in the tooth, and it would be nice to retire it someday.

Cheers,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Glue-like Functionality

2017-07-08 Thread Benjamin Kim
Has anyone seen AWS Glue? I was wondering if there is something similar going 
to be built into Spark Structured Streaming? I like the Data Catalog idea to 
store and track any data source/destination. It profiles the data to derive the 
scheme and data types. Also, it does some sort-of automated schema evolution 
when or if the schema changes. It leaves only the transformation logic to the 
ETL developer. I think some of this can enhance or simplify Structured 
Streaming. For example, AWS S3 can be catalogued as a Data Source; in 
Structured Streaming, Input DataFrame is created like a SQL view based off of 
the S3 Data Source; lastly, the Transform logic, if any, just manipulates the 
data going from the Input DataFrame to the Result DataFrame, which is another 
view based off of a catalogued Data Destination. This would relieve the ETL 
developer from caring about any Data Source or Destination. All server 
information, access credentials, data schemas, folder directory structures, 
file formats, and any other properties can be securely stored away with only a 
select few.

I'm just curious to know if anyone has thought the same thing.

Cheers,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Benjamin Kim
Hi Bo,

+1 for your project. I come from the world of data warehouses, ETL, and 
reporting analytics. There are many individuals who do not know or want to do 
any coding. They are content with ANSI SQL and stick to it. ETL workflows are 
also done without any coding using a drag-and-drop user interface, such as 
Talend, SSIS, etc. There is a small amount of scripting involved but not too 
much. I looked at what you are trying to do, and I welcome it. This could open 
up Spark to the masses and shorten development times.

Cheers,
Ben


> On Jun 12, 2017, at 10:14 PM, bo yang  wrote:
> 
> Hi Aakash,
> 
> Thanks for your willing to help :) It will be great if I could get more 
> feedback on my project. For example, is there any other people feeling the 
> need of using a script to write Spark job easily? Also, I would explore 
> whether it is possible that the Spark project takes some work to build such a 
> script based high level DSL.
> 
> Best,
> Bo
> 
> 
> On Mon, Jun 12, 2017 at 12:14 PM, Aakash Basu  > wrote:
> Hey,
> 
> I work on Spark SQL and would pretty much be able to help you in this. Let me 
> know your requirement.
> 
> Thanks,
> Aakash.
> 
> On 12-Jun-2017 11:00 AM, "bo yang"  > wrote:
> Hi Guys,
> 
> I am writing a small open source project 
>  to use SQL Script to write Spark 
> Jobs. Want to see if there are other people interested to use or contribute 
> to this project.
> 
> The project is called UberScriptQuery 
> (https://github.com/uber/uberscriptquery 
> ). Sorry for the dumb name to avoid 
> conflict with many other names (Spark is registered trademark, thus I could 
> not use Spark in my project name).
> 
> In short, it is a high level SQL-like DSL (Domain Specific Language) on top 
> of Spark. People can use that DSL to write Spark jobs without worrying about 
> Spark internal details. Please check README 
>  in the project to get more details.
> 
> It will be great if I could get any feedback or suggestions!
> 
> Best,
> Bo
> 
> 



Serverless ETL

2017-10-17 Thread Benjamin Kim
With AWS having Glue and GCE having Dataprep, is Databricks coming out with
an equivalent or better? I know that Serverless is a new offering, but will
it go farther with automatic data schema discovery, profiling, metadata
storage, change triggering, joining, transform suggestions, etc.?

Just curious.

Cheers,
Ben


Re: Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim
To add, we have a CDH 5.12 cluster with Spark 2.2 in our data center.

On Mon, Nov 13, 2017 at 3:15 PM Benjamin Kim <bbuil...@gmail.com> wrote:

> Does anyone know if there is a connector for AWS Kinesis that can be used
> as a source for Structured Streaming?
>
> Thanks.
>
>


Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim
Does anyone know if there is a connector for AWS Kinesis that can be used
as a source for Structured Streaming?

Thanks.


Databricks Serverless

2017-11-13 Thread Benjamin Kim
I have a question about this. The documentation compares the concept
similar to BigQuery. Does this mean that we will no longer need to deal
with instances and just pay for execution duration and amount of data
processed? I’m just curious about how this will be priced.

Also, when will it be ready for production?

Cheers.


Re: Append In-Place to S3

2018-06-07 Thread Benjamin Kim
I tried a different tactic. I still append based on the query below, but I add 
another deduping step afterwards, writing to a staging directory then 
overwriting back. Luckily, the data is small enough for this to happen fast.

Cheers,
Ben

> On Jun 3, 2018, at 3:02 PM, Tayler Lawrence Jones  
> wrote:
> 
> Sorry actually my last message is not true for anti join, I was thinking of 
> semi join. 
> 
> -TJ
> 
> On Sun, Jun 3, 2018 at 14:57 Tayler Lawrence Jones  <mailto:t.jonesd...@gmail.com>> wrote:
> A left join with null filter is only the same as a left anti join if the join 
> keys can be guaranteed unique in the existing data. Since hive tables on s3 
> offer no unique guarantees outside of your processing code, I recommend using 
> left anti join over left join + null filter.
> 
> -TJ
> 
> On Sun, Jun 3, 2018 at 14:47 ayan guha  <mailto:guha.a...@gmail.com>> wrote:
> I do not use anti join semantics, but you can use left outer join and then 
> filter out nulls from right side. Your data may have dups on the columns 
> separately but it should not have dups on the composite key ie all columns 
> put together.
> 
> On Mon, 4 Jun 2018 at 6:42 am, Tayler Lawrence Jones  <mailto:t.jonesd...@gmail.com>> wrote:
> The issue is not the append vs overwrite - perhaps those responders do not 
> know Anti join semantics. Further, Overwrite on s3 is a bad pattern due to s3 
> eventual consistency issues. 
> 
> First, your sql query is wrong as you don’t close the parenthesis of the CTE 
> (“with” part). In fact, it looks like you don’t need that with at all, and 
> the query should fail to parse. If that does parse, I would open a bug on the 
> spark jira.
> 
> Can you provide the query that you are using to detect duplication so I can 
> see if your deduplication logic matches the detection query? 
> 
> -TJ
> 
> On Sat, Jun 2, 2018 at 10:22 Aakash Basu  <mailto:aakash.spark@gmail.com>> wrote:
> As Jay suggested correctly, if you're joining then overwrite otherwise only 
> append as it removes dups.
> 
> I think, in this scenario, just change it to write.mode('overwrite') because 
> you're already reading the old data and your job would be done.
> 
> 
> On Sat 2 Jun, 2018, 10:27 PM Benjamin Kim,  <mailto:bbuil...@gmail.com>> wrote:
> Hi Jay,
> 
> Thanks for your response. Are you saying to append the new data and then 
> remove the duplicates to the whole data set afterwards overwriting the 
> existing data set with new data set with appended values? I will give that a 
> try. 
> 
> Cheers,
> Ben
> 
> On Fri, Jun 1, 2018 at 11:49 PM Jay  <mailto:jayadeep.jayara...@gmail.com>> wrote:
> Benjamin,
> 
> The append will append the "new" data to the existing data with removing the 
> duplicates. You would need to overwrite the file everytime if you need unique 
> values.
> 
> Thanks,
> Jayadeep
> 
> On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim  <mailto:bbuil...@gmail.com>> wrote:
> I have a situation where I trying to add only new rows to an existing data 
> set that lives in S3 as gzipped parquet files, looping and appending for each 
> hour of the day. First, I create a DF from the existing data, then I use a 
> query to create another DF with the data that is new. Here is the code 
> snippet.
> 
> df = spark.read.parquet(existing_data_path)
> df.createOrReplaceTempView(‘existing_data’)
> new_df = spark.read.parquet(new_data_path)
> new_df.createOrReplaceTempView(’new_data’)
> append_df = spark.sql(
> """
> WITH ids AS (
> SELECT DISTINCT
> source,
> source_id,
> target,
> target_id
> FROM new_data i
> LEFT ANTI JOIN existing_data im
> ON i.source = im.source
> AND i.source_id = im.source_id
> AND i.target = im.target
> AND i.target = im.target_id
> """
> )
> append_df.coalesce(1).write.parquet(existing_data_path, mode='append', 
> compression='gzip’)
> 
> I thought this would append new rows and keep the data unique, but I am see 
> many duplicates. Can someone help me with this and tell me what I am doing 
> wrong?
> 
> Thanks,
> Ben
> -- 
> Best Regards,
> Ayan Guha



Re: Append In-Place to S3

2018-06-02 Thread Benjamin Kim
Hi Jay,

Thanks for your response. Are you saying to append the new data and then
remove the duplicates to the whole data set afterwards overwriting the
existing data set with new data set with appended values? I will give that
a try.

Cheers,
Ben

On Fri, Jun 1, 2018 at 11:49 PM Jay  wrote:

> Benjamin,
>
> The append will append the "new" data to the existing data with removing
> the duplicates. You would need to overwrite the file everytime if you need
> unique values.
>
> Thanks,
> Jayadeep
>
> On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim  wrote:
>
>> I have a situation where I trying to add only new rows to an existing
>> data set that lives in S3 as gzipped parquet files, looping and appending
>> for each hour of the day. First, I create a DF from the existing data, then
>> I use a query to create another DF with the data that is new. Here is the
>> code snippet.
>>
>> df = spark.read.parquet(existing_data_path)
>> df.createOrReplaceTempView(‘existing_data’)
>> new_df = spark.read.parquet(new_data_path)
>> new_df.createOrReplaceTempView(’new_data’)
>> append_df = spark.sql(
>> """
>> WITH ids AS (
>> SELECT DISTINCT
>> source,
>> source_id,
>> target,
>> target_id
>> FROM new_data i
>> LEFT ANTI JOIN existing_data im
>> ON i.source = im.source
>> AND i.source_id = im.source_id
>> AND i.target = im.target
>> AND i.target = im.target_id
>> """
>> )
>> append_df.coalesce(1).write.parquet(existing_data_path, mode='append',
>> compression='gzip’)
>>
>>
>> I thought this would append new rows and keep the data unique, but I am
>> see many duplicates. Can someone help me with this and tell me what I am
>> doing wrong?
>>
>> Thanks,
>> Ben
>>
>


Append In-Place to S3

2018-06-01 Thread Benjamin Kim
I have a situation where I trying to add only new rows to an existing data set 
that lives in S3 as gzipped parquet files, looping and appending for each hour 
of the day. First, I create a DF from the existing data, then I use a query to 
create another DF with the data that is new. Here is the code snippet.

df = spark.read.parquet(existing_data_path)
df.createOrReplaceTempView(‘existing_data’)
new_df = spark.read.parquet(new_data_path)
new_df.createOrReplaceTempView(’new_data’)
append_df = spark.sql(
"""
WITH ids AS (
SELECT DISTINCT
source,
source_id,
target,
target_id
FROM new_data i
LEFT ANTI JOIN existing_data im
ON i.source = im.source
AND i.source_id = im.source_id
AND i.target = im.target
AND i.target = im.target_id
"""
)
append_df.coalesce(1).write.parquet(existing_data_path, mode='append', 
compression='gzip’)

I thought this would append new rows and keep the data unique, but I am see 
many duplicates. Can someone help me with this and tell me what I am doing 
wrong?

Thanks,
Ben

<    1   2