from:"Benjamin"

Vulnerability Remediation inquiry

2024-10-20 Thread Benjamin Liu

a new release with this fix has not been cut yet. Could you please provide me with a projected release date for the vulnerability fix? I would greatly appreciate it! Thank you, Benjamin Liu

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du

0, 2022 1:08 AM To: sebastian@gmail.com Cc: Benjamin Du ; u...@spark.incubator.apache.org Subject: Re: A Persisted Spark DataFrame is computed twice Hi, without getting into suppositions, the best option is to look into the SPARK UI SQL section. It is the most wonderful tool to explain w

Re: A Persisted Spark DataFrame is computed twice

2022-01-31 Thread Benjamin Du

: Deepak Sharma Sent: Sunday, January 30, 2022 12:45 AM To: Benjamin Du Cc: u...@spark.incubator.apache.org Subject: Re: A Persisted Spark DataFrame is computed twice coalesce returns a new dataset. That will cause the recomputation. Thanks Deepak On Sun, 30 Jan 2022 at 14:06, Benjamin Du m

Re: A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du

e a DataFrame to disk, read it back, repartition/coalesce it, and then write it back to HDFS. spark.read.parquet("/input/hdfs/path") \ .filter(col("n0") == n0) \ .filter(col("n1") == n1) \ .filter(col("h1") == h1) \ .filter

A Persisted Spark DataFrame is computed twice

2022-01-30 Thread Benjamin Du

I have some PySpark code like below. Basically, I persist a DataFrame (which is time-consuming to compute) to disk, call the method DataFrame.count to trigger the caching/persist immediately, and then I coalesce the DataFrame to reduce the number of partitions (the original DataFrame has 30,000

Apache Spark 3.2.0 Voluntary Product Assessment Template (VPAT)

2021-10-27 Thread Benjamin Murphy - IQGA-C

artifact is available via the web, please provide a download link. In the event that your organization declines to complete the VPAT/ACR, please notify me at the contact information provided below. Very Respectfully, Ben and team! -- BENJAMIN MURPHY | *Technical Project Manager / Scrum Master* GSA

Re: [RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du

From: Sean Owen Sent: Monday, October 4, 2021 1:00 PM To: Benjamin Du Cc: user@spark.apache.org Subject: Re: [RNG]: How does Spark handle RNGs? The 2nd approach. Spark doesn't work in the 1st way in any context - the driver and executor processes do not cooperate during execution. Op

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du

Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a separat

[RNG]: How does Spark handle RNGs?

2021-10-04 Thread Benjamin Du

Hi everyone, I'd like to ask how does Spark (or more generally, distributed computing engines) handle RNGs? High-level speaking, there are two ways, 1. Use a single RNG on the driver and random numbers generating on each work makes request to the single RNG on the driver. 2. Use a separat

Re: Append In-Place to S3

2018-06-07 Thread Benjamin Kim

ted correctly, if you're joining then overwrite otherwise only > append as it removes dups. > > I think, in this scenario, just change it to write.mode('overwrite') because > you're already reading the old data and your job would be done. > > > On Sat 2 Ju

Re: Append In-Place to S3

2018-06-02 Thread Benjamin Kim

: > Benjamin, > > The append will append the "new" data to the existing data with removing > the duplicates. You would need to overwrite the file everytime if you need > unique values. > > Thanks, > Jayadeep > > On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote: &

Append In-Place to S3

2018-06-01 Thread Benjamin Kim

I have a situation where I trying to add only new rows to an existing data set that lives in S3 as gzipped parquet files, looping and appending for each hour of the day. First, I create a DF from the existing data, then I use a query to create another DF with the data that is new. Here is the co

Re: Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim

To add, we have a CDH 5.12 cluster with Spark 2.2 in our data center. On Mon, Nov 13, 2017 at 3:15 PM Benjamin Kim wrote: > Does anyone know if there is a connector for AWS Kinesis that can be used > as a source for Structured Streaming? > > Thanks. > >

Databricks Serverless

2017-11-13 Thread Benjamin Kim

I have a question about this. The documentation compares the concept similar to BigQuery. Does this mean that we will no longer need to deal with instances and just pay for execution duration and amount of data processed? I’m just curious about how this will be priced. Also, when will it be ready

Spark 2.2 Structured Streaming + Kinesis

2017-11-13 Thread Benjamin Kim

Does anyone know if there is a connector for AWS Kinesis that can be used as a source for Structured Streaming? Thanks.

Serverless ETL

2017-10-17 Thread Benjamin Kim

With AWS having Glue and GCE having Dataprep, is Databricks coming out with an equivalent or better? I know that Serverless is a new offering, but will it go farther with automatic data schema discovery, profiling, metadata storage, change triggering, joining, transform suggestions, etc.? Just cur

Unsubscribe

2017-08-08 Thread Benjamin Soemartopo

From: john_test_test Sent: Wednesday, August 9, 2017 3:09:44 AM To: user@spark.apache.org Subject: speculative execution in spark Is it possible by anyhow to take advantage of the already processed portion of the failed task so I can use the speculative executio

Glue-like Functionality

2017-07-08 Thread Benjamin Kim

Has anyone seen AWS Glue? I was wondering if there is something similar going to be built into Spark Structured Streaming? I like the Data Catalog idea to store and track any data source/destination. It profiles the data to derive the scheme and data types. Also, it does some sort-of automated s

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Benjamin Kim

Hi Bo, +1 for your project. I come from the world of data warehouses, ETL, and reporting analytics. There are many individuals who do not know or want to do any coding. They are content with ANSI SQL and stick to it. ETL workflows are also done without any coding using a drag-and-drop user inte

Spark 2.1 and Hive Metastore

2017-04-09 Thread Benjamin Kim

I’m curious about if and when Spark SQL will ever remove its dependency on Hive Metastore. Now that Spark 2.1’s SparkSession has superseded the need for HiveContext, are there plans for Spark to no longer use the Hive Metastore service with a “SparkSchema” service with a PostgreSQL, MySQL, etc.

[Spark 2.1.0 ML] Serializing/Deserializing LocalLDA Problem

2017-02-27 Thread Benjamin Edwards

I am hoping someone can confirm this is a bug and/or provide a solution. I am trying to serialize an LDA model to disk for later use, but upon deserialization the model is not fully functional. In particular, transformation of data throws a NullPointerException. Here is a minimal example (just run

Re: Get S3 Parquet File

2017-02-24 Thread Benjamin Kim

Gourav, I’ll start experimenting with Spark 2.1 to see if this works. Cheers, Ben > On Feb 24, 2017, at 5:46 AM, Gourav Sengupta > wrote: > > Hi Benjamin, > > First of all fetching data from S3 while writing a code in on premise system > is a very bad idea. You mig

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim

o Spark 2.0/2.1. > > And besides that would you not want to work on a platform which is at least > 10 times faster What would that be? > > Regards, > Gourav Sengupta > > On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > We are t

Re: Get S3 Parquet File

2017-02-23 Thread Benjamin Kim

can be > hidden and read from Input Params. > > Thanks, > Aakash. > > > On 23-Feb-2017 11:54 PM, "Benjamin Kim" <mailto:bbuil...@gmail.com>> wrote: > We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet > file from AWS S

Get S3 Parquet File

2017-02-23 Thread Benjamin Kim

We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below. com.cloudera.com.amazonaws.AmazonClien

Re: Parquet Gzipped Files

2017-02-14 Thread Benjamin Kim

ur vendor should use the parquet internal compression and not take a > parquet file and gzip it. > >> On 13 Feb 2017, at 18:48, Benjamin Kim wrote: >> >> We are receiving files from an outside vendor who creates a Parquet data >> file and Gzips it before delivery.

Parquet Gzipped Files

2017-02-13 Thread Benjamin Kim

We are receiving files from an outside vendor who creates a Parquet data file and Gzips it before delivery. Does anyone know how to Gunzip the file in Spark and inject the Parquet data into a DataFrame? I thought using sc.textFile or sc.wholeTextFiles would automatically Gunzip the file, but I’m

Remove dependence on HDFS

2017-02-11 Thread Benjamin Kim

Has anyone got some advice on how to remove the reliance on HDFS for storing persistent data. We have an on-premise Spark cluster. It seems like a waste of resources to keep adding nodes because of a lack of storage space only. I would rather add more powerful nodes due to the lack of processing

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

ote: > > Did you check the actual maven dep tree? Something might be pulling in a > different version. Also, if you're seeing this locally, you might want to > check which version of the scala sdk your IDE is using > > Asher Krim > Senior Software Engineer > > > On

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

did you see only scala 2.10.5 being pulled in? > > On Fri, Feb 3, 2017 at 12:33 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Asher, > > It’s still the same. Do you have any other ideas? > > Cheers, > Ben > > >> On Feb 3, 2017, at 8:16 AM, A

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

to > check which version of the scala sdk your IDE is using > > Asher Krim > Senior Software Engineer > > > On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Hi Asher, > > I modified the pom to be the same Spark (1.6.0), HBas

Re: HBase Spark

2017-02-03 Thread Benjamin Kim

7;re seeing this locally, you might want to > check which version of the scala sdk your IDE is using > > Asher Krim > Senior Software Engineer > > On Thu, Feb 2, 2017 at 5:43 PM, Benjamin Kim wrote: > > Hi Asher, > > I modified the pom to be the same Spark (1.6.0),

Re: HBase Spark

2017-02-02 Thread Benjamin Kim

her Krim wrote: > > Ben, > > That looks like a scala version mismatch. Have you checked your dep tree? > > Asher Krim > Senior Software Engineer > > > On Thu, Feb 2, 2017 at 1:28 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Elek, > >

Re: HBase Spark

2017-02-02 Thread Benjamin Kim

ltSource.createRelation(HBaseRelation.scala:51) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) If you can please help, I would be grateful. Cheers, Ben > O

Re: HBase Spark

2017-01-31 Thread Benjamin Kim

Elek, If I cannot use the HBase Spark module, then I’ll give it a try. Thanks, Ben > On Jan 31, 2017, at 1:02 PM, Marton, Elek wrote: > > > I tested this one with hbase 1.2.4: > > https://github.com/hortonworks-spark/shc > > Marton > > On 01/31/2017 09:17 P

HBase Spark

2017-01-31 Thread Benjamin Kim

Does anyone know how to backport the HBase Spark module to HBase 1.2.0? I tried to build it from source, but I cannot get it to work. Thanks, Ben - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim

Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote: Hi Benjamin, As you might already know, I believe the Hadoop command automatically does not merge the column-based format such as ORC or Parquet but just simply concatenates them. I

Re: Merging Parquet Files

2016-12-22 Thread Benjamin Kim

Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t work, I’ll try Kite. Cheers, Ben > On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon wrote: > > Hi Benjamin, > > > As you might already know, I believe the Hadoop command automatically

Merging Parquet Files

2016-12-22 Thread Benjamin Kim

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them into 1 file after they are output from Spark. Doing a coalesce(1) on the Spark cluster will not work. It just does not have the resources to do it. I'm trying to do it using the commandline and not use Spark. I will us

Re: Deep learning libraries for scala

2016-11-01 Thread Benjamin Kim

eed. But as it states deeper integration with (scala) is yet to be > developed. > Any thoughts on how to use tensorflow with scala ? Need to write wrappers I > think. > > > On Oct 19, 2016 7:56 AM, "Benjamin Kim" <mailto:bbuil...@gmail.com>> wrote: > On

Spark Streaming and Kinesis

2016-10-27 Thread Benjamin Kim

Has anyone worked with AWS Kinesis and retrieved data from it using Spark Streaming? I am having issues where it’s returning no data. I can connect to the Kinesis stream and describe using Spark. Is there something I’m missing? Are there specific IAM security settings needed? I just simply follo

Re: Deep learning libraries for scala

2016-10-19 Thread Benjamin Kim

On that note, here is an article that Databricks made regarding using Tensorflow in conjunction with Spark. https://databricks.com/blog/2016/01/25/deep-learning-with-apache-spark-and-tensorflow.html Cheers, Ben > On Oct 19, 2016, at 3:09 AM, Gourav Sengupta > wrote: > > while using Deep Lea

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim

> table cache and expose it through the thriftserver. But you have to implement > the loading logic, it can be very simple to very complex depending on your > needs. > > > 2016-10-17 19:48 GMT+02:00 Benjamin Kim <mailto:bbuil...@gmail.com>>: > Is this techniq

Re: Spark SQL Thriftserver with HBase

2016-10-17 Thread Benjamin Kim

terface in to the big data world > revolves around the JDBC/ODBC interface. So if you don’t have that piece as > part of your solution, you’re DOA w respect to Tableau. > > Have you considered Drill as your JDBC connection point? (YAAP: Yet another > Apache project) >

Re: Inserting New Primary Keys

2016-10-10 Thread Benjamin Kim

Is there only one process adding rows? because this seems a little risky if > you have multiple threads doing that… > >> On Oct 8, 2016, at 1:43 PM, Benjamin Kim > <mailto:bbuil...@gmail.com>> wrote: >> >> Mich, >> >> After much searching, I

Re: Spark SQL Thriftserver with HBase

2016-10-09 Thread Benjamin Kim

ll provide an in-memory cache for interactive analytics. You > can put full tables in-memory with Hive using Ignite HDFS in-memory solution. > All this does only make sense if you do not use MR as an engine, the right > input format (ORC, parquet) and a recent Hive version. > >

Re: Spark SQL Thriftserver with HBase

2016-10-08 Thread Benjamin Kim

responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage o

Re: Spark SQL Thriftserver with HBase

2016-10-08 Thread Benjamin Kim

aming specifics, there are at least 4 or 5 different implementations > of HBASE sources, each at varying level of development and different > requirements (HBASE release version, Kerberos support etc) > > > _ > From: Benjamin Kim mailto:bbuil...

Re: Spark SQL Thriftserver with HBase

2016-10-08 Thread Benjamin Kim

e it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such &

Re: Spark SQL Thriftserver with HBase

2016-10-08 Thread Benjamin Kim

experience with this! > > > _ > From: Benjamin Kim mailto:bbuil...@gmail.com>> > Sent: Saturday, October 8, 2016 11:00 AM > Subject: Re: Spark SQL Thriftserver with HBase > To: Felix Cheung <mailto:felixcheun...@hotmail.com>> > Cc: m

Re: Spark SQL Thriftserver with HBase

2016-10-08 Thread Benjamin Kim

Thrift Server (with USING, > http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10 > <http://spark.apache.org/docs/latest/sql-programming-guide.html#tab_sql_10>). > > > _ > From: Benjamin Kim mailto:bbuil...@gmail.com>&

Re: Inserting New Primary Keys

2016-10-08 Thread Benjamin Kim

damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 8 Octo

Re: Spark SQL Thriftserver with HBase

2016-10-08 Thread Benjamin Kim

book.html#spark> > > And if you search you should find several alternative approaches. > > > > > > On Fri, Oct 7, 2016 at 7:56 AM -0700, "Benjamin Kim" <mailto:bbuil...@gmail.com>> wrote: > > Does anyone know if Spark can work with HBase tab

Inserting New Primary Keys

2016-10-08 Thread Benjamin Kim

I have a table with data already in it that has primary keys generated by the function monotonicallyIncreasingId. Now, I want to insert more data into it with primary keys that will auto-increment from where the existing data left off. How would I do this? There is no argument I can pass into th

Spark SQL Thriftserver with HBase

2016-10-07 Thread Benjamin Kim

Does anyone know if Spark can work with HBase tables using Spark SQL? I know in Hive we are able to create tables on top of an underlying HBase table that can be accessed using MapReduce jobs. Can the same be done using HiveContext or SQLContext? We are trying to setup a way to GET and POST data

Re: RESTful Endpoint and Spark

2016-10-07 Thread Benjamin Kim

On Oct 6, 2016, at 4:27 PM, Benjamin Kim wrote: >> >> Has anyone tried to integrate Spark with a server farm of RESTful API >> endpoints or even HTTP web-servers for that matter? I know it’s typically >> done using a web farm as the presentation interface, then data flows thro

RESTful Endpoint and Spark

2016-10-06 Thread Benjamin Kim

Has anyone tried to integrate Spark with a server farm of RESTful API endpoints or even HTTP web-servers for that matter? I know it’s typically done using a web farm as the presentation interface, then data flows through a firewall/router to direct calls to a JDBC listener that will SELECT, INSE

Re: Deep learning libraries for scala

2016-10-03 Thread Benjamin Kim

I got this email a while back in regards to this. Dear Spark users and developers, I have released version 1.0.0 of scalable-deeplearning package. This package is based on the implementation of artificial neural networks in Spark ML. It is intended for new Spark deep learning features that wer

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-03 Thread Benjamin Kim

gt; That sounds interesting, would love to learn more about it. > > Mitch: looks good. Lastly I would suggest you to think if you really need > multiple column families. > > On 4 Oct 2016 02:57, "Benjamin Kim" <mailto:bbuil...@gmail.com>> wrote: > Lately, I

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-03 Thread Benjamin Kim

COLUMN+CELL > Tesco PLC > column=stock_daily:close, timestamp=1475447365118, value=325.25 > Tesco PLC > column=stock_daily:high, timestamp=1475447365118, value=332.00 > Tesc

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-02 Thread Benjamin Kim

any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 1 October 2016 at 23:39, Benjamin Kim &

Re: Loading data into Hbase table throws NoClassDefFoundError: org/apache/htrace/Trace error

2016-10-01 Thread Benjamin Kim

Mich, I know up until CDH 5.4 we had to add the HTrace jar to the classpath to make it work using the command below. But after upgrading to CDH 5.7, it became unnecessary. echo "/opt/cloudera/parcels/CDH/jars/htrace-core-3.2.0-incubating.jar" >> /etc/spark/conf/classpath.txt Hope this helps.

Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim

. Thanks, Ben > On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet wrote: > > Hi! Can you split init code with current comand? I thing it is main problem > in your code. > > 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" <mailto:bbuil...@gmail.com>> на

JDBC Very Slow

2016-09-16 Thread Benjamin Kim

Has anyone using Spark 1.6.2 encountered very slow responses from pulling data from PostgreSQL using JDBC? I can get to the table and see the schema, but when I do a show, it takes very long or keeps timing out. The code is simple. val jdbcDF = sqlContext.read.format("jdbc").options( Map("u

Re: Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim

> tables which "point to" any other DB. i know Oracle provides there own Serde > for hive. Not sure about PG though. > > Once tables are created in hive, STS will automatically see it. > > On Wed, Sep 14, 2016 at 11:08 AM, Benjamin Kim <mailto:bbuil...@gmail.

Using Spark SQL to Create JDBC Tables

2016-09-13 Thread Benjamin Kim

Has anyone created tables using Spark SQL that directly connect to a JDBC data source such as PostgreSQL? I would like to use Spark SQL Thriftserver to access and query remote PostgreSQL tables. In this way, we can centralize data access to Spark SQL tables along with PostgreSQL making it very c

Re: Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim

own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such

Spark SQL Thriftserver

2016-09-13 Thread Benjamin Kim

Does anyone have any thoughts about using Spark SQL Thriftserver in Spark 1.6.2 instead of HiveServer2? We are considering abandoning HiveServer2 for it. Some advice and gotcha’s would be nice to know. Thanks, Ben - To unsubscri

Re: Spark Metrics: custom source/sink configurations not getting recognized

2016-09-06 Thread Benjamin Kim

We use Graphite/Grafana for custom metrics. We found Spark’s metrics not to be customizable. So, we write directly using Graphite’s API, which was very easy to do using Java’s socket library in Scala. It works great for us, and we are going one step further using Sensu to alert us if there is an

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim

may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > On 3 September 2016 at 20:31, Benjamin Kim <mailto:bbuil...@gm

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim

s.com/> > > Disclaimer: Use it at your own risk. Any and all responsibility for any loss, > damage or destruction of data or any other property which may arise from > relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable fo

Spark SQL Tables on top of HBase Tables

2016-09-02 Thread Benjamin Kim

I was wondering if anyone has tried to create Spark SQL tables on top of HBase tables so that data in HBase can be accessed using Spark Thriftserver with SQL statements? This is similar what can be done using Hive. Thanks, Ben ---

Spark 1.6 Streaming with Checkpointing

2016-08-26 Thread Benjamin Kim

I am trying to implement checkpointing in my streaming application but I am getting a not serializable error. Has anyone encountered this? I am deploying this job in YARN clustered mode. Here is a snippet of the main parts of the code. object S3EventIngestion { //create and setup streaming

HBase-Spark Module

2016-07-29 Thread Benjamin Kim

I would like to know if anyone has tried using the hbase-spark module? I tried to follow the examples in conjunction with CDH 5.8.0. I cannot find the HBaseTableCatalog class in the module or in any of the Spark jars. Can someone help? Thanks, Ben ---

Re: How to connect HBase and Spark using Python?

2016-07-22 Thread Benjamin Kim

It is included in Cloudera’s CDH 5.8. > On Jul 22, 2016, at 6:13 PM, Mail.com wrote: > > Hbase Spark module will be available with Hbase 2.0. Is that out yet? > >> On Jul 22, 2016, at 8:50 PM, Def_Os wrote: >> >> So it appears it should be possible to use HBase's new hbase-spark module, if >>

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Benjamin Kim

From what I read, there is no more Contexts. "SparkContext, SQLContext, HiveContext merged into SparkSession" I have not tested it, but I don’t know if it’s true. Cheers, Ben > On Jul 18, 2016, at 8:37 AM, Koert Kuipers wrote: > > in my codebase i would like to gradually transition t

Re: Spark Website

2016-07-13 Thread Benjamin Kim

It takes me to the directories instead of the webpage. > On Jul 13, 2016, at 11:45 AM, manish ranjan wrote: > > working for me. What do you mean 'as supposed to'? > > ~Manish > > > > On Wed, Jul 13, 2016 at 11:45 AM, Benjamin Kim <mailto:bbuil...

Spark Website

2016-07-13 Thread Benjamin Kim

Has anyone noticed that the spark.apache.org is not working as supposed to? - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim

requencyCol 'retweets', timeSeriesColumn > 'tweetTime' )" > where 'tweetStreamTable' is created using the 'create stream table ...' SQL > syntax. > > > - > Jags > SnappyData blog <http://www.snappydata.io/blog> > D

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim

> Jags > SnappyData blog <http://www.snappydata.io/blog> > Download binary, source <https://github.com/SnappyDataInc/snappydata> > > > On Wed, Jul 6, 2016 at 12:49 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > I recently got a sales email from Sna

SnappyData and Structured Streaming

2016-07-05 Thread Benjamin Kim

I recently got a sales email from SnappyData, and after reading the documentation about what they offer, it sounds very similar to what Structured Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD compliant data storage in SnappyData. I was wondering if Structured Streaming

Kudu Connector

2016-06-29 Thread Benjamin Kim

I was wondering if anyone, who is a Spark Scala developer, would be willing to continue the work done for the Kudu connector? https://github.com/apache/incubator-kudu/tree/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu I have been testing and using Kudu for the past month and compar

Model Quality Tracking

2016-06-24 Thread Benjamin Kim

Has anyone implemented a way to track the performance of a data model? We currently have an algorithm to do record linkage and spit out statistics of matches, non-matches, and/or partial matches with reason codes of why we didn’t match accurately. In this way, we will know if something goes wron

Data Integrity / Model Quality Monitoring

2016-06-17 Thread Benjamin Kim

Has anyone run into this requirement? We have a need to track data integrity and model quality metrics of outcomes so that we can both gauge if the data is healthy coming in and the models run against them are still performing and not giving faulty results. A nice to have would be to graph thes

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

browser_major_version string > browser_minor_version string > os_family string > os_name string > os_version string > os_major_versionstring > os_minor_versionstring > # Partition Information > # col_name

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

t; > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://tal

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

;, `os_name` string COMMENT '', `os_version` string COMMENT '', `os_major_version` string COMMENT '',

Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim

Does anyone know how to save data in a DataFrame to a table partitioned using an existing column reformatted into a derived column? val partitionedDf = df.withColumn("dt", concat(substring($"timestamp", 1, 10), lit(" "), substring($"timestamp", 12, 2), lit(":00")))

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim

Ben > On May 21, 2016, at 4:18 AM, Ted Yu wrote: > > Maybe more than one version of jets3t-xx.jar was on the classpath. > > FYI > > On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > I am trying to stream files from an S3 buck

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim

could be wrong. Thanks, Ben > On May 21, 2016, at 4:18 AM, Ted Yu wrote: > > Maybe more than one version of jets3t-xx.jar was on the classpath. > > FYI > > On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > I am trying to stream

Spark Streaming S3 Error

2016-05-20 Thread Benjamin Kim

I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of Spark 1.6.0. It seems not to work. I keep getting this error. Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand stack Exception Details: Location: org/apache/hadoop/fs/s3native/Jets3tNat

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Benjamin Kim

I have a curiosity question. These forever/unlimited DataFrames/DataSets will persist and be query capable. I still am foggy about how this data will be stored. As far as I know, memory is finite. Will the data be spilled to disk and be retrievable if the query spans data not in memory? Is Tachy

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim

obile: +972-54-7801286 | Email: > ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io> > On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Hi Ofir, > > I just recently saw the webinar with Reynold Xin. He mentioned the Spark

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim

Hi Ofir, I just recently saw the webinar with Reynold Xin. He mentioned the Spark Session unification efforts, but I don’t remember the DataSet for Structured Streaming aka Continuous Applications as he put it. He did mention streaming or unlimited DataFrames for Structured Streaming so one can

Re: Save DataFrame to HBase

2016-05-10 Thread Benjamin Kim

gt; Cheers > > On Apr 27, 2016, at 10:31 PM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > >> Hi Ted, >> >> Do you know when the release will be? I also see some documentation for >> usage of the hbase-spark module at the hbase website. But, I d

Re: Spark 2.0 Release Date

2016-04-28 Thread Benjamin Kim

Next Thursday is Databricks' webinar on Spark 2.0. If you are attending, I bet many are going to ask when the release will be. Last time they did this, Spark 1.6 came out not too long afterward. > On Apr 28, 2016, at 5:21 AM, Sean Owen wrote: > > I don't know if anyone has begun a firm discuss

Spark 2.0+ Structured Streaming

2016-04-28 Thread Benjamin Kim

Can someone explain to me how the new Structured Streaming works in the upcoming Spark 2.0+? I’m a little hazy how data will be stored and referenced if it can be queried and/or batch processed directly from streams and if the data will be append only to or will there be some sort of upsert capa

Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim

? Thanks, Ben > On Apr 21, 2016, at 6:56 AM, Ted Yu wrote: > > The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do > this. > > On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <mailto:bbuil...@gmail.com>> wrote: > Has anyone found an easy way

Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim

at 11:46 AM, Paras sachdeva > wrote: > > Hi Daniel, > > Would you possibly be able to share the snipped to code you have used ? > > Thank you. > > On Wed, Apr 27, 2016 at 3:13 PM, Daniel Haviv > mailto:daniel.ha...@veracity-group.com>> > wrote: > H

Convert DataFrame to Array of Arrays

2016-04-24 Thread Benjamin Kim

I have data in a DataFrame loaded from a CSV file. I need to load this data into HBase using an RDD formatted in a certain way. val rdd = sc.parallelize( Array(key1, (ColumnFamily, ColumnName1, Value1), (ColumnFamily, ColumnName2, Value2), (

1 2 3 >

1 - 100 of 215 matches

Mail list logo