Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim
Gelhausen <rgel...@gmail.com > <mailto:rgel...@gmail.com>> wrote: > I don't- I hoped providing that information may help finding & fixing the > problem. > > On Thu, Jul 7, 2016 at 5:53 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.

Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim
Hi Randy, Do you know of any way to fix it or know of a workaround? Thanks, Ben > On Jul 7, 2016, at 2:08 PM, Randy Gelhausen wrote: > > HTTP 500 errors from a Helium URL

Re: Shiro LDAP w/ Search Bind Authentication

2016-07-06 Thread Benjamin Kim
ldaps calls without issue. We're then using > group memberships to define roles and control access to notebooks. > > Hope that helps. > > Rob > > > On Wed, Jul 6, 2016 at 2:01 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote

Shiro LDAP w/ Search Bind Authentication

2016-07-06 Thread Benjamin Kim
I have been trying to find documentation on how to enable LDAP authentication, but I cannot find how to enter the values for these configurations. This is necessary because our LDAP server is secured. Here are the properties that I need to set: ldap_cert use_start_tls bind_dn bind_password Can

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim
"options(key 'hashtag', frequencyCol 'retweets', timeSeriesColumn > 'tweetTime' )" > where 'tweetStreamTable' is created using the 'create stream table ...' SQL > syntax. > > > - > Jags > SnappyData blog <http://www.snappydata.io/blog> > Download binary, s

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim
re). > > > - > Jags > SnappyData blog <http://www.snappydata.io/blog> > Download binary, source <https://github.com/SnappyDataInc/snappydata> > > > On Wed, Jul 6, 2016 at 12:49 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>&g

Re: Performance Question

2016-07-06 Thread Benjamin Kim
ata frame or SQL, etc? Maybe you can share the schema and > some queries > > Todd > > Todd > > On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Todd, > > Now that Kudu 0.9.0 is out

SnappyData and Structured Streaming

2016-07-05 Thread Benjamin Kim
I recently got a sales email from SnappyData, and after reading the documentation about what they offer, it sounds very similar to what Structured Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD compliant data storage in SnappyData. I was wondering if Structured Streaming

Re: spark interpreter

2016-07-01 Thread Benjamin Kim
cluded in 0.6.0. If it is not listed when you create > interpreter setting, could you check if your 'zeppelin.interpreters' property > list Livy interpreter classes? (conf/zeppelin-site.xml) > > Thanks, > moon > > On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim <bbuil...@gmail.c

Re: Performance Question

2016-06-30 Thread Benjamin Kim
or SQL, etc? Maybe you can share the schema and > some queries > > Todd > > Todd > > On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Todd, > > Now that Kudu 0.9.0 is out. I h

Re: spark interpreter

2016-06-30 Thread Benjamin Kim
gt; wrote: > > Hi Ben, > > Livy interpreter is included in 0.6.0. If it is not listed when you create > interpreter setting, could you check if your 'zeppelin.interpreters' property > list Livy interpreter classes? (conf/zeppelin-site.xml) > > Thanks, > moon > > On

Re: Performance Question

2016-06-29 Thread Benjamin Kim
, Todd Lipcon <t...@cloudera.com> wrote: > > On Wed, Jun 29, 2016 at 11:32 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Todd, > > I started Spark streaming more events into Kudu. Performance is great there > too! With HBase

Re: spark interpreter

2016-06-29 Thread Benjamin Kim
On a side note… Has anyone got the Livy interpreter to be added as an interpreter in the latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. Could this interfere? Thanks, Ben > On Jun 29, 2016, at 11:18 AM, moon soo Lee wrote: > > Livy interpreter

Re: Performance Question

2016-06-29 Thread Benjamin Kim
seconds you're seeing is constant overhead from Spark job setup, etc, given > that the performance doesn't seem to get slower as you went from 700K rows to > 13M rows. > > -Todd > > On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:b

Kudu Connector

2016-06-29 Thread Benjamin Kim
I was wondering if anyone, who is a Spark Scala developer, would be willing to continue the work done for the Kudu connector? https://github.com/apache/incubator-kudu/tree/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu I have been testing and using Kudu for the past month and

Re: Performance Question

2016-06-28 Thread Benjamin Kim
guide > <http://getkudu.io/docs/schema_design.html#data-distribution>. We generally > recommend sticking to hash partitioning if possible, since you don't have to > determine your own split rows. > > - Dan > > On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil.

Re: Spark 1.6 (CDH 5.7) and Phoenix 4.7 (CLABS)

2016-06-27 Thread Benjamin Kim
t; > For problems with the Cloudera Labs packaging of Apache Phoenix, you should > first seek help on the vendor-specific community forums, to ensure the issue > isn't specific to the vendor: > > http://community.cloudera.com/t5/Cloudera-Labs/bd-p/ClouderaLabs > > -busbey > &

Spark 1.6 (CDH 5.7) and Phoenix 4.7 (CLABS)

2016-06-27 Thread Benjamin Kim
Anyone tried to save a DataFrame to a HBase table using Phoenix? I am able to load and read, but I can’t save. >> spark-shell —jars >>

livy interpreter not appearing

2016-06-26 Thread Benjamin Kim
Has anyone tried using the livy interpreter? I cannot add it. It just does not appear after clicking save. Thanks, Ben

Re: phoenix on non-apache hbase

2016-06-25 Thread Benjamin Kim
> is it reasonable to expect a "stock" phoenix client to work against a custom >> phoenix server for cdh 5.x? (with of course the phoenix client and server >> having same phoenix version). >> >> >> >> On Thu, Jun 9, 2016 at 10:55 PM, Benjamin Kim

Model Quality Tracking

2016-06-24 Thread Benjamin Kim
Has anyone implemented a way to track the performance of a data model? We currently have an algorithm to do record linkage and spit out statistics of matches, non-matches, and/or partial matches with reason codes of why we didn’t match accurately. In this way, we will know if something goes

Re: Spark on Kudu

2016-06-20 Thread Benjamin Kim
psert. These modes come from spark, and they were really designed >> for file-backed storage and not table storage. We may want to do append = >> upsert, and overwrite = truncate + insert. I think that may match the >> normal spark semantics more closely. >> >> - Da

Data Integrity / Model Quality Monitoring

2016-06-17 Thread Benjamin Kim
Has anyone run into this requirement? We have a need to track data integrity and model quality metrics of outcomes so that we can both gauge if the data is healthy coming in and the models run against them are still performing and not giving faulty results. A nice to have would be to graph

Re: Spark on Kudu

2016-06-17 Thread Benjamin Kim
e semantics will be, but at least one of them >> will be upsert. These modes come from spark, and they were really designed >> for file-backed storage and not table storage. We may want to do append = >> upsert, and overwrite = truncate + insert. I think that may match

Re: Spark on Kudu

2016-06-17 Thread Benjamin Kim
ite = truncate + insert. I think that may match the normal > spark semantics more closely. > > - Dan > > On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Dan, > > Thanks for the information. That w

Re: Ask opinion regarding 0.6.0 release package

2016-06-17 Thread Benjamin Kim
Hi, Our company’s use is spark, phoenix, jdbc/psql. So, if you make different packages, I would need the full one. In addition, for the minimized one, would there be a way to pick and choose interpreters to add/plug in? Thanks, Ben > On Jun 17, 2016, at 1:02 AM, mina lee

Re: Spark on Kudu

2016-06-15 Thread Benjamin Kim
, and they were really designed > for file-backed storage and not table storage. We may want to do append = > upsert, and overwrite = truncate + insert. I think that may match the normal > spark semantics more closely. > > - Dan > > On Tue, Jun 14, 2016 at 6:00 PM, Benjam

Re: Performance Question

2016-06-15 Thread Benjamin Kim
nt to try a table with replication count 1 > > On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Todd, > > I did a simple test of our ad events. We stream using Spark Streaming > directly into HBase,

Re: Performance Question

2016-06-15 Thread Benjamin Kim
part that scares most users is when it comes to joining this data with other dimension/3rd party events tables because of shear size of it. We do what most companies do, similar to what I saw in earlier presentations of Kudu. We dump data out of HBase into partitioned Parquet tables to make qu

Re: Performance Question

2016-06-15 Thread Benjamin Kim
really do some conclusive tests? I want to see if I can match your results on my 50 node cluster. Thanks, Ben > On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote: > > On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbui

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim
would happen if I “overwrite” existing data when the DataFrame has data in it that does not exist in the Kudu table? I need to evaluate the best way to simulate the UPSERT behavior in HBase because this is what our use case is. Thanks, Ben > On Jun 14, 2016, at 5:05 PM, Benjamin Kim <

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim
found: key not found (error 0)Not found: key not found (error 0)Not found: key not found (error 0)Not found: key not found (error 0) Does the key field need to be first in the DataFrame? Thanks, Ben > On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote: > > &

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim
ould you try: > > import org.kududb.client._ > and try again? > > - Dan > > On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > I encountered an error trying to create a table based on the documentation > fr

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-14 Thread Benjamin Kim
Hi J-D, I would like to get this started especially now that UPSERT and Spark SQL DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a parcel available yet? Is there a new CSD file to be downloaded? I currently have CM 5.7.0 installed. Thanks, Ben > On Jun 10,

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Benjamin Kim
Hi J-D, I would like to get this started especially now that UPSERT and Spark SQL DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a parcel available yet? Is there a new CSD file to be downloaded? I currently have CM 5.7.0 installed. Thanks, Ben > On Jun 10,

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Benjamin Kim
Hi J-D, I would like to get this started especially now that UPSERT and Spark SQL DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a parcel available yet? Is there a new CSD file to be downloaded? I currently have CM 5.7.0 installed. Thanks, Ben > On Jun 10,

Re: phoenix on non-apache hbase

2016-06-09 Thread Benjamin Kim
This interests me too. I asked Cloudera in their community forums a while back but got no answer on this. I hope they don’t leave us out in the cold. I tried building it too before with the instructions here https://issues.apache.org/jira/browse/PHOENIX-2834. I could get it to build, but I

Github Integration

2016-06-09 Thread Benjamin Kim
I heard that Zeppelin 0.6.0 is able to use its local notebook directory as a Github repo. Does anyone know of a way to have it work (workaround) with our company’s Github (Stash) repo server? Any advice would be welcome. Thanks, Ben

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
g.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > Compressed: No > Num Buckets:

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 3 June 2016 at 17:04, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > The table already exists. > > CREATE EXTERNAL TABLE `amo

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
xianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > > On 3 June 2016 at 14:13, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Does anyone know how to save data in a DataFrame to

Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
Does anyone know how to save data in a DataFrame to a table partitioned using an existing column reformatted into a derived column? val partitionedDf = df.withColumn("dt", concat(substring($"timestamp", 1, 10), lit(" "), substring($"timestamp", 12, 2), lit(":00")))

Re: Spark on Kudu

2016-05-28 Thread Benjamin Kim
0/#/c/2992/5/docs/developing.adoc > <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc> > > -Chris George > > > On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > > Can someone

Re: Performance Question

2016-05-28 Thread Benjamin Kim
where support will be built in? Thanks, Ben > On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com> wrote: > > On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Mike, > > First of

Re: Performance Question

2016-05-27 Thread Benjamin Kim
ey have addressed any of those issues. > > Mike > > On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > I am just curious. How will Kudu compare with Aerospike > (http://www.aerospike.com <http://www.aerospike.com/

Performance Question

2016-05-27 Thread Benjamin Kim
I am just curious. How will Kudu compare with Aerospike (http://www.aerospike.com)? I went to a Spark Roadshow and found out about this piece of software. It appears to fit our use case perfectly since we are an ad-tech company trying to leverage our user profiles data. Plus, it already has a

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
Ben > On May 21, 2016, at 4:18 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Maybe more than one version of jets3t-xx.jar was on the classpath. > > FYI > > On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>>

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
could be wrong. Thanks, Ben > On May 21, 2016, at 4:18 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Maybe more than one version of jets3t-xx.jar was on the classpath. > > FYI > > On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil

Spark Streaming S3 Error

2016-05-20 Thread Benjamin Kim
I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of Spark 1.6.0. It seems not to work. I keep getting this error. Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand stack Exception Details: Location:

Re: Spark on Kudu

2016-05-18 Thread Benjamin Kim
port these type of statements but we may be able to > implement similar functionality through the api. > -Chris > > On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > > It would be nice to adhere to the SQL:200

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Benjamin Kim
I have a curiosity question. These forever/unlimited DataFrames/DataSets will persist and be query capable. I still am foggy about how this data will be stored. As far as I know, memory is finite. Will the data be spilled to disk and be retrievable if the query spans data not in memory? Is

CDH 5.7.0

2016-05-16 Thread Benjamin Kim
Has anyone got Phoenix to work with CDH 5.7.0? I tried manually patching and building the project using https://issues.apache.org/jira/browse/PHOENIX-2834 as a guide. I followed the instructions to install the components detailed in the top

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
m > > > Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: > ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io> > On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Hi Ofir, >

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Hi Ofir, I just recently saw the webinar with Reynold Xin. He mentioned the Spark Session unification efforts, but I don’t remember the DataSet for Structured Streaming aka Continuous Applications as he put it. He did mention streaming or unlimited DataFrames for Structured Streaming so one

Sparse Data

2016-05-12 Thread Benjamin Kim
Can Kudu handle the use case where sparse data is involved? In many of our processes, we deal with data that can have any number of columns and many previously unknown column names depending on what attributes are brought in at the time. Currently, we use HBase to handle this. Since Kudu is

Re: Help with getting Zeppelin running on CH 5.7

2016-05-11 Thread Benjamin Kim
It’s currently being addressed here. https://github.com/apache/incubator-zeppelin/pull/868 > On May 11, 2016, at 3:08 PM, Shankar Roy wrote: > > Hi, > I am trying to get Zeppeling running on a pseudo node

Re: Save DataFrame to HBase

2016-05-10 Thread Benjamin Kim
in hbase-spark module. > > Cheers > > On Apr 27, 2016, at 10:31 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > >> Hi Ted, >> >> Do you know when the release will be? I also see some documentation for >> usage of the hb

Zeppelin 0.6 Build

2016-05-07 Thread Benjamin Kim
When trying to build the latest from Git, I get these errors. [ERROR] /home/zeppelin/incubator-zeppelin/spark/src/main/java/org/apache/zeppelin/spark/ZeppelinR.java:[25,25] package parquet.org.slf4j does not exist [ERROR]

Completed Tasks in YARN will not release resources

2016-04-30 Thread Benjamin Kim
Has anyone encountered this problem with YARN. It all started after an attempt to upgrade from CDH 5.4.8 to CDH 5.5.2. I ran jobs overnight, and they never completed. But, it did take down the YARN ResourceManager and multiple NodeManagers after 5 or 6 hours. There was one job that out of 450

AWS SDK Client

2016-04-28 Thread Benjamin Kim
Has anyone used the AWS SDK client libraries in Java to instantiate a client for Spark jobs? Up to a few days ago, the SQS client was not having any problems, but all of a sudden, this error came up. java.lang.NoSuchMethodError:

Re: Spark 2.0 Release Date

2016-04-28 Thread Benjamin Kim
Next Thursday is Databricks' webinar on Spark 2.0. If you are attending, I bet many are going to ask when the release will be. Last time they did this, Spark 1.6 came out not too long afterward. > On Apr 28, 2016, at 5:21 AM, Sean Owen wrote: > > I don't know if anyone has

Spark 2.0+ Structured Streaming

2016-04-28 Thread Benjamin Kim
Can someone explain to me how the new Structured Streaming works in the upcoming Spark 2.0+? I’m a little hazy how data will be stored and referenced if it can be queried and/or batch processed directly from streams and if the data will be append only to or will there be some sort of upsert

Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim
? Thanks, Ben > On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do > this. > > On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmai

Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim
ty-group.com <mailto:daniel.ha...@veracity-group.com>> > wrote: > Hi Benjamin, > Yes it should work. > > Let me know if you need further assistance I might be able to get the code > I've used for that project. > > Thank you. > Daniel > > On 24 Apr 2016, at 17:

Convert DataFrame to Array of Arrays

2016-04-24 Thread Benjamin Kim
I have data in a DataFrame loaded from a CSV file. I need to load this data into HBase using an RDD formatted in a certain way. val rdd = sc.parallelize( Array(key1, (ColumnFamily, ColumnName1, Value1), (ColumnFamily, ColumnName2, Value2),

Re: Save DataFrame to HBase

2016-04-24 Thread Benjamin Kim
gt; I tried saving DF to HBase using a hive table with hbase storage handler and > hiveContext but it failed due to a bug. > > I was able to persist the DF to hbase using Apache Pheonix which was pretty > simple. > > Thank you. > Daniel > > On 21 Apr 2016, at 16:52, B

Re: Save DataFrame to HBase

2016-04-21 Thread Benjamin Kim
release) can do > this. > > On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Has anyone found an easy way to save a DataFrame into HBase? > > Thanks, > Ben > > >

Save DataFrame to HBase

2016-04-21 Thread Benjamin Kim
Has anyone found an easy way to save a DataFrame into HBase? Thanks, Ben - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

HBase Spark Module

2016-04-20 Thread Benjamin Kim
I see that the new CDH 5.7 has been release with the HBase Spark module built-in. I was wondering if I could just download it and use the hbase-spark jar file for CDH 5.5. Has anyone tried this yet? Thanks, Ben - To

Re: JSON Usage

2016-04-17 Thread Benjamin Kim
>> You could certainly use RDDs for that, you might also find using Dataset >> selecting the fields you need to construct the URL to fetch and then using >> the map function to be easier. >> >> On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim <bbuil...@gmail

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
t; Hi, > > Would you try this codes below? > > val csvRDD = ...your processimg for csv rdd.. > val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true) > > Thanks! > > On 16 Apr 2016 1:35 a.m., "Benjamin Kim" <bbuil...@gmail.com > <ma

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
gt; Would you try this codes below? > > val csvRDD = ...your processimg for csv rdd.. > val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true) > > Thanks! > > On 16 Apr 2016 1:35 a.m., "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...

Re: JSON Usage

2016-04-15 Thread Benjamin Kim
Karau <hol...@pigscanfly.ca> wrote: > > You could certainly use RDDs for that, you might also find using Dataset > selecting the fields you need to construct the URL to fetch and then using > the map function to be easier. > > On Thu, Apr 14, 2016 at 12:01 PM, Be

Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
lease check csvRdd api here, > https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150 > > <https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150>. > > Thanks! >

JSON Usage

2016-04-14 Thread Benjamin Kim
I was wonder what would be the best way to use JSON in Spark/Scala. I need to lookup values of fields in a collection of records to form a URL and download that file at that location. I was thinking an RDD would be perfect for this. I just want to hear from others who might have more experience

Data Export

2016-04-14 Thread Benjamin Kim
Does anyone know when the exporting of data into CSV, TSV, etc. files will be available? Thanks, Ben

Re: Spark on Kudu

2016-04-13 Thread Benjamin Kim
implement similar functionality through the api. > -Chris > > On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > > It would be nice to adhere to the SQL:2003 standard for an “upsert” if it > were to be

Re: Spark on Kudu

2016-04-12 Thread Benjamin Kim
hat's as fully featured as Impala's? Do they > care being able to insert into Kudu with SparkSQL or just being able to query > real fast? Anything more specific to Spark that I'm missing? > > FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all our > resource

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-12 Thread Benjamin Kim
.load("s3://" + bucket + "/" + key) //save to hbase }) ssc.checkpoint(checkpointDirectory) // set checkpoint directory ssc } Thanks, Ben > On Apr 9, 2016, at 6:12 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > > Ah, I spoke too soon. > >

Re: Spark on Kudu

2016-04-10 Thread Benjamin Kim
coming from… Cheers, Ben > On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org> wrote: > > On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > J-D, > > The main thing I hear that Cass

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
, please let me know. Thanks, Ben > On Apr 9, 2016, at 2:49 PM, Benjamin Kim <bbuil...@gmail.com> wrote: > > This was easy! > > I just created a notification on a source S3 bucket to kick off a Lambda > function that would decompress the dropped file and save it to another

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
to be the endpoint of this notification. This would then convey to a listening Spark Streaming job the file information to download. I like this! Cheers, Ben > On Apr 9, 2016, at 9:54 AM, Benjamin Kim <bbuil...@gmail.com> wrote: > > This is awesome! I have someplace to start from. >

Re: Spark Plugin Information

2016-04-09 Thread Benjamin Kim
t to Phoenix using JDBC, you should be able to take > the JDBC url, pop off the 'jdbc:phoenix:' prefix and use it as the 'zkUrl' > option. > > Josh > > On Fri, Apr 8, 2016 at 6:47 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: >

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
t; Sent from my iPhone > > On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > >> Nezih, >> >> This looks like a good alternative to having the Spark Streaming job check >> for new files

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
ext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", > AWSSecretAccessKey) > > val inputS3Stream = ssc.textFileStream("s3://example_bucket/folder") > > This code will probe for new S3 files created in your every batch interval. > > Thanks, &g

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
w S3 files created in your every batch interval. > > Thanks, > Natu > > On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Has anyone monitored an S3 bucket or directory us

Re: Spark Plugin Information

2016-04-08 Thread Benjamin Kim
hough not supported by the Phoenix project at large, you may find this > Docker image useful as a configuration reference: > https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark > <https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark> > > Good luck! >

Re: Zeppelin Dashboards

2016-04-08 Thread Benjamin Kim
Ashish, Quick question… Does this include charts updating in near real-time? Just wondering. Cheers, Ben > On Apr 6, 2016, at 12:21 PM, moon soo Lee wrote: > > Hi Ashish, > > Would tweaking looknfeel >

Re: Zeppelin UX Design Roadmap Proposal

2016-04-08 Thread Benjamin Kim
Hi Jeremy, I was wondering if the code bar could be placed on the side too? Appear and disappear too? And quick view the contents upon rollover? Otherwise, what you put together is great! Cheers, Ben > On Apr 7, 2016, at 1:44 PM, Jeremy Anderson > wrote: > > Hi

Spark Plugin Information

2016-04-08 Thread Benjamin Kim
I want to know if there is a update/patch coming to Spark or the Spark plugin? I see that the Spark plugin does not work because HBase classes are missing from the Spark Assembly jar. So, when Spark does reflection, it does not look for HBase client classes in the Phoenix Plugin jar but only in

Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Benjamin Kim
Has anyone monitored an S3 bucket or directory using Spark Streaming and pulled any new files to process? If so, can you provide basic Scala coding help on this? Thanks, Ben - To unsubscribe, e-mail:

Spark Plugin Information

2016-04-08 Thread Benjamin Kim
I want to know if there is a update/patch coming to Spark or the Spark plugin? I see that the Spark plugin does not work because HBase classes are missing from the Spark Assembly jar. So, when Spark does reflection, it does not look for HBase client classes in the Phoenix Plugin jar but only in

can spark-csv package accept strings instead of files?

2016-04-01 Thread Benjamin Kim
Does anyone know if this is possible? I have an RDD loaded with rows of CSV data strings. Each string representing the header row and multiple rows of data along with delimiters. I would like to feed each thru a CSV parser to convert the data into a dataframe and, ultimately, UPSERT a

Re: Data Export

2016-04-01 Thread Benjamin Kim
ems to be the > most recent PR for that. > > On Wed, Mar 16, 2016 at 1:26 PM, Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>> wrote: > Any updates as to the progress of this issue? > > >> On Feb 26, 2016, at 6:16 PM, Khalid Huseynov &l

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
y-MM-dd')) > AS TransactionDate > , TransactionType > , Description > , Value > , Balance > , AccountName > , AccountNumber > FROM tmp > """ > sql(sqltext) > > println ("\nFinished at");

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
gt; > HTH > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com &l

Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
I have a quick question. I have downloaded multiple zipped files from S3 and unzipped each one of them into strings. The next step is to parse using a CSV parser. I want to know if there is a way to easily use the spark csv package for this? Thanks, Ben

Re: Importaing Hbase data

2016-03-25 Thread Benjamin Kim
The hbase-spark module is still a work in progress in terms of Spark SQL. All the RDD methods are complete and ready to use against the current version of HBase 1.0+, but the use of DataFrames will require the unreleased version of HBase 2.0. Fortunately, there is work in progress to back-port

BinaryFiles to ZipInputStream

2016-03-23 Thread Benjamin Kim
I need a little help. I am loading into Spark 1.6 zipped csv files stored in s3. First of all, I am able to get the List of file keys that have a modified date within a range of time by using the AWS SDK Objects (AmazonS3Client, ObjectListing, S3ObjectSummary, ListObjectsRequest,

Re: new object store driver for Spark

2016-03-22 Thread Benjamin Kim
Hi Gil, Currently, our company uses S3 heavily for data storage. Can you further explain the benefits of this in relation to S3 when the pending patch does come out? Also, I have heard of Swift from others. Can you explain to me the pros and cons of Swift compared to HDFS? It can be just a

<    1   2   3   >