Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
y. > > With respect to Tableau… their entire interface in to the big data world > revolves around the JDBC/ODBC interface. So if you don’t have that piece as > part of your solution, you’re DOA w respect to Tableau. > > Have you considered Drill as your JDBC connecti

Re: Deep learning libraries for scala

2016-10-19 Thread Benjamin Kim
On that note, here is an article that Databricks made regarding using Tensorflow in conjunction with Spark. https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Cheers, Ben > On Oct 19, 2016, at 3:09 AM, Gourav Sengupta >

Re: Deep learning libraries for scala

2016-11-01 Thread Benjamin Kim
.@gmail.com> wrote: > > Agreed. But as it states deeper integration with (scala) is yet to be > developed. > Any thoughts on how to use tensorflow with scala ? Need to write wrappers I > think. > > > On Oct 19, 2016 7:56 AM, "Benjamin Kim" <bbuil...@gmail.com

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim
lta load data in spark > table cache and expose it through the thriftserver. But you have to implement > the loading logic, it can be very simple to very complex depending on your > needs. > > > 2016-10-17 19:48 GMT+02:00 Benjamin Kim <bbuil...@gmail.com > <mailto:bb

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
k/kite This might be useful. Thanks! 2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com>: Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark cluster will not work. It just d

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim
s://issues.apache.org/jira/browse/PARQUET-460> > > It seems parquet-tools allows merge small Parquet files into one. > > > Also, I believe there are command-line tools in Kite - > https://github.com/kite-sdk/kite <https://github.com/kite-sdk/kite> > > This might

Merging Parquet Files

2016-12-22 Thread Benjamin Kim
Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark cluster will not work. It just does not have the resources to do it. I'm trying to do it using the commandline and not use Spark. I will

Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim
I’m curious about if and when Spark SQL will ever remove its dependency on Hive Metastore. Now that Spark 2.1’s SparkSession has superseded the need for HiveContext, are there plans for Spark to no longer use the Hive Metastore service with a “SparkSchema” service with a PostgreSQL, MySQL, etc.

Glue-like Functionality

2017-07-08 Thread Benjamin Kim
Has anyone seen AWS Glue? I was wondering if there is something similar going to be built into Spark Structured Streaming? I like the Data Catalog idea to store and track any data source/destination. It profiles the data to derive the scheme and data types. Also, it does some sort-of automated

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Benjamin Kim
Hi Bo, +1 for your project. I come from the world of data warehouses, ETL, and reporting analytics. There are many individuals who do not know or want to do any coding. They are content with ANSI SQL and stick to it. ETL workflows are also done without any coding using a drag-and-drop user

Serverless ETL

2017-10-17 Thread Benjamin Kim
With AWS having Glue and GCE having Dataprep, is Databricks coming out with an equivalent or better? I know that Serverless is a new offering, but will it go farther with automatic data schema discovery, profiling, metadata storage, change triggering, joining, transform suggestions, etc.? Just

Re: Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim
To add, we have a CDH 5.12 cluster with Spark 2.2 in our data center. On Mon, Nov 13, 2017 at 3:15 PM Benjamin Kim <bbuil...@gmail.com> wrote: > Does anyone know if there is a connector for AWS Kinesis that can be used > as a source for Structured Streaming? > > Thanks. > >

Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim
Does anyone know if there is a connector for AWS Kinesis that can be used as a source for Structured Streaming? Thanks.

Databricks Serverless

2017-11-13 Thread Benjamin Kim
I have a question about this. The documentation compares the concept similar to BigQuery. Does this mean that we will no longer need to deal with instances and just pay for execution duration and amount of data processed? I’m just curious about how this will be priced. Also, when will it be ready

Re: Append In-Place to S3

2018-06-07 Thread Benjamin Kim
ted correctly, if you're joining then overwrite otherwise only > append as it removes dups. > > I think, in this scenario, just change it to write.mode('overwrite') because > you're already reading the old data and your job would be done. > > > On Sat 2 Jun, 2018, 10:27 PM Be

Re: Append In-Place to S3

2018-06-02 Thread Benjamin Kim
: > Benjamin, > > The append will append the "new" data to the existing data with removing > the duplicates. You would need to overwrite the file everytime if you need > unique values. > > Thanks, > Jayadeep > > On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote

Append In-Place to S3

2018-06-01 Thread Benjamin Kim
I have a situation where I trying to add only new rows to an existing data set that lives in S3 as gzipped parquet files, looping and appending for each hour of the day. First, I create a DF from the existing data, then I use a query to create another DF with the data that is new. Here is the

<    1   2