Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim
Moon,

My environmental setup consists of an 18 node CentOS 6.7 cluster with 24 cores, 
64GB, 12TB storage each:
3 of those nodes are used as Zookeeper servers, HDFS name nodes, and a YARN 
resource manager
15 are for data nodes
jdk1.8_60 and CDH 5.7.1 installed

Another node is an app server, 24 cores, 128GB memory, 1TB storage. It has 
Zeppelin 0.6.0 and Livy 0.2.0 running on it. Plus, Hive Metastore and 
HiveServer2, Hue, and Oozie are running on it from CDH 5.7.1.

This is our QA cluster where we are testing before deploying to production.

If you need more information, please let me know.

Thanks,
Ben

 

> On Jul 7, 2016, at 7:54 PM, moon soo Lee <m...@apache.org> wrote:
> 
> Randy,
> 
> Helium is not included in 0.6.0 release. Could you check which version are 
> you using?
> I created a fix for 500 errors from Helium URL in master branch. 
> https://github.com/apache/zeppelin/pull/1150 
> <https://github.com/apache/zeppelin/pull/1150>
> 
> Ben,
> I can not reproduce the error, could you share how to reproduce error, or 
> share your environment?
> 
> Thanks,
> moon
> 
> On Thu, Jul 7, 2016 at 4:02 PM Randy Gelhausen <rgel...@gmail.com 
> <mailto:rgel...@gmail.com>> wrote:
> I don't- I hoped providing that information may help finding & fixing the 
> problem.
> 
> On Thu, Jul 7, 2016 at 5:53 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Randy,
> 
> Do you know of any way to fix it or know of a workaround?
> 
> Thanks,
> Ben
> 
>> On Jul 7, 2016, at 2:08 PM, Randy Gelhausen <rgel...@gmail.com 
>> <mailto:rgel...@gmail.com>> wrote:
>> 
>> HTTP 500 errors from a Helium URL
> 
> 



Re: [ANNOUNCE] Apache Zeppelin 0.6.0 released

2016-07-07 Thread Benjamin Kim
Hi Randy,

Do you know of any way to fix it or know of a workaround?

Thanks,
Ben

> On Jul 7, 2016, at 2:08 PM, Randy Gelhausen  wrote:
> 
> HTTP 500 errors from a Helium URL



Re: Shiro LDAP w/ Search Bind Authentication

2016-07-06 Thread Benjamin Kim
Rob,

I got it to work without having to use those settings. I guess Shiro gets 
around our LDAP authentication.

Thanks,
Ben


> On Jul 6, 2016, at 3:33 PM, Rob Anderson <rockclimbings...@gmail.com> wrote:
> 
> You can find some documentation on it here: 
> https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/security/shiroauthentication.html
>  
> <https://zeppelin.apache.org/docs/0.7.0-SNAPSHOT/security/shiroauthentication.html>
> 
> I believe you'll need to be running the .6 release or .7 snapshot to use 
> shiro.
> 
> We're authing against AD via ldaps calls without issue.  We're then using 
> group memberships to define roles and control access to notebooks.
> 
> Hope that helps.
> 
> Rob
> 
> 
> On Wed, Jul 6, 2016 at 2:01 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I have been trying to find documentation on how to enable LDAP 
> authentication, but I cannot find how to enter the values for these 
> configurations. This is necessary because our LDAP server is secured. Here 
> are the properties that I need to set:
> ldap_cert
> use_start_tls
> bind_dn
> bind_password
> 
> Can someone help?
> 
> Thanks,
> Ben
> 
> 



Shiro LDAP w/ Search Bind Authentication

2016-07-06 Thread Benjamin Kim
I have been trying to find documentation on how to enable LDAP authentication, 
but I cannot find how to enter the values for these configurations. This is 
necessary because our LDAP server is secured. Here are the properties that I 
need to set:
ldap_cert
use_start_tls
bind_dn
bind_password

Can someone help?

Thanks,
Ben



Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim
Jags,

Thanks for the details. This makes things much clearer. I saw in the Spark 
roadmap that version 2.1 will add the SQL capabilities mentioned here. It looks 
like, gradually, the Spark community is coming to the same conclusions that the 
SnappyData folks have come to a while back in terms of Streaming. But, there is 
always the need for a better way to store data underlying Spark. The State 
Store information was informative too. I can envision that it can use this data 
store too if need be.

Thanks again,
Ben

> On Jul 6, 2016, at 8:52 AM, Jags Ramnarayan <jramnara...@snappydata.io> wrote:
> 
> The plan is to fully integrate with the new structured streaming API and 
> implementation in an upcoming release. But, we will continue offering several 
> extensions. Few noted below ...
> 
> - the store (streaming sink) will offer a lot more capabilities like 
> transactions, replicated tables, partitioned row and column oriented tables 
> to suit different types of workloads. 
> - While streaming API(scala) in snappydata itself will change a bit to become 
> fully compatible with structured streaming(SchemaDStream will go away), we 
> will continue to offer SQL support for streams so they can be managed from 
> external clients (JDBC, ODBC), their partitions can share the same 
> partitioning strategy as the underlying table where it might be stored, and 
> even registrations of continuous queries from remote clients. 
> 
> While building streaming apps using the Spark APi offers tremendous 
> flexibility we also want to make it simple for apps to work with streams just 
> using SQL. For instance, you should be able to declaratively specify a table 
> as a sink to a stream(i.e. using SQL). For example, you can specify a "TopK 
> Table" (a built in special table for topK analytics using probabilistic data 
> structures) as a sink for a high velocity time series stream like this - 
> "create topK table MostPopularTweets on tweetStreamTable " +
> "options(key 'hashtag', frequencyCol 'retweets', timeSeriesColumn 
> 'tweetTime' )" 
> where 'tweetStreamTable' is created using the 'create stream table ...' SQL 
> syntax. 
> 
> 
> -
> Jags
> SnappyData blog <http://www.snappydata.io/blog>
> Download binary, source <https://github.com/SnappyDataInc/snappydata>
> 
> 
> On Wed, Jul 6, 2016 at 8:02 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Jags,
> 
> I should have been more specific. I am referring to what I read at 
> http://snappydatainc.github.io/snappydata/streamingWithSQL/ 
> <http://snappydatainc.github.io/snappydata/streamingWithSQL/>, especially the 
> Streaming Tables part. It roughly coincides with the Streaming DataFrames 
> outlined here 
> https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.ff0opfdo6q1h
>  
> <https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.ff0opfdo6q1h>.
>  I don’t if I’m wrong, but they both sound very similar. That’s why I posed 
> this question.
> 
> Thanks,
> Ben
> 
>> On Jul 6, 2016, at 7:03 AM, Jags Ramnarayan <jramnara...@snappydata.io 
>> <mailto:jramnara...@snappydata.io>> wrote:
>> 
>> Ben,
>>Note that Snappydata's primary objective is to be a distributed in-memory 
>> DB for mixed workloads (i.e. streaming with transactions and analytic 
>> queries). On the other hand, Spark, till date, is primarily designed as a 
>> processing engine over myriad storage engines (SnappyData being one). So, 
>> the marriage is quite complementary. The difference compared to other stores 
>> is that SnappyData realizes its solution by deeply integrating and 
>> collocating with Spark (i.e. share spark executor memory/resources with the 
>> store) avoiding serializations and shuffle in many situations.
>> 
>> On your specific thought about being similar to Structured streaming, a 
>> better discussion could be a comparison to the recently introduced State 
>> store 
>> <https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#heading=h.2h7zw4ru3nw7>
>>  (perhaps this is what you meant). 
>> It proposes a KV store for streaming aggregations with support for updates. 
>> The proposed API will, at some point, be pluggable so vendors can easily 
>> support alternate implementations to storage, not just HDFS(default store in 
>> proposed State store). 
>> 
>> 
>> -
>> Jags
>> SnappyData blog <http://www.snappydata.io/blog>
>> Download binary, source <https://github.com/SnappyDataInc/snappydata>
&g

Re: SnappyData and Structured Streaming

2016-07-06 Thread Benjamin Kim
Jags,

I should have been more specific. I am referring to what I read at 
http://snappydatainc.github.io/snappydata/streamingWithSQL/, especially the 
Streaming Tables part. It roughly coincides with the Streaming DataFrames 
outlined here 
https://docs.google.com/document/d/1NHKdRSNCbCmJbinLmZuqNA1Pt6CGpFnLVRbzuDUcZVM/edit#heading=h.ff0opfdo6q1h.
 I don’t if I’m wrong, but they both sound very similar. That’s why I posed 
this question.

Thanks,
Ben

> On Jul 6, 2016, at 7:03 AM, Jags Ramnarayan <jramnara...@snappydata.io> wrote:
> 
> Ben,
>Note that Snappydata's primary objective is to be a distributed in-memory 
> DB for mixed workloads (i.e. streaming with transactions and analytic 
> queries). On the other hand, Spark, till date, is primarily designed as a 
> processing engine over myriad storage engines (SnappyData being one). So, the 
> marriage is quite complementary. The difference compared to other stores is 
> that SnappyData realizes its solution by deeply integrating and collocating 
> with Spark (i.e. share spark executor memory/resources with the store) 
> avoiding serializations and shuffle in many situations.
> 
> On your specific thought about being similar to Structured streaming, a 
> better discussion could be a comparison to the recently introduced State 
> store 
> <https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit#heading=h.2h7zw4ru3nw7>
>  (perhaps this is what you meant). 
> It proposes a KV store for streaming aggregations with support for updates. 
> The proposed API will, at some point, be pluggable so vendors can easily 
> support alternate implementations to storage, not just HDFS(default store in 
> proposed State store). 
> 
> 
> -
> Jags
> SnappyData blog <http://www.snappydata.io/blog>
> Download binary, source <https://github.com/SnappyDataInc/snappydata>
> 
> 
> On Wed, Jul 6, 2016 at 12:49 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I recently got a sales email from SnappyData, and after reading the 
> documentation about what they offer, it sounds very similar to what 
> Structured Streaming will offer w/o the underlying in-memory, spill-to-disk, 
> CRUD compliant data storage in SnappyData. I was wondering if Structured 
> Streaming is trying to achieve the same on its own or is SnappyData 
> contributing Streaming extensions that they built to the Spark project. 
> Lastly, what does the Spark community think of this so-called “Spark Data 
> Store”?
> 
> Thanks,
> Ben
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 



Re: Performance Question

2016-07-06 Thread Benjamin Kim
Over the weekend, the row count is up to <500M. I will give it another few days 
to get to 1B rows. I still get consistent times ~15s for doing row counts 
despite the amount of data growing.

On another note, I got a solicitation email from SnappyData to evaluate their 
product. They claim to be the “Spark Data Store” with tight integration with 
Spark executors. It claims to be an OLTP and OLAP system with being an 
in-memory data store first then to disk. After going to several Spark events, 
it would seem that this is the new “hot” area for vendors. They all (MemSQL, 
Redis, Aerospike, Datastax, etc.) claim to be the best "Spark Data Store”. I’m 
wondering if Kudu will become this too? With the performance I’ve seen so far, 
it would seem that it can be a contender. All that is needed is a hardened 
Spark connector package, I would think. The next evaluation I will be 
conducting is to see if SnappyData’s claims are valid by doing my own tests.

Cheers,
Ben


> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't have a lot of benchmarks published yet, especially on the write 
>> side. I've found that thorough cross-system benchmarks are very difficult to 
>> do fairly and accurately, and often times users end up misguided if they pay 
>> too much attention to them :) So, given a finite number of developers 
>> working on Kudu, I think we've tended to spend more time on the project 
>> itself and less time focusing on "competition". I'm sure there are use cases 
>> where Kudu will beat out Aerospike, and probably use cases where Aerospike 
>> will beat Kudu as well.
>> 
>> From my perspective, it would be great if you can share some details of your 
>> workload, especially if there are some areas you're finding Kudu lacking. 
>> Maybe we can spot some easy code changes we could make to improve 
>> performance, or suggest a tuning variable you could change.
>> 
>> -Todd
>> 
>> 
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Mike,
>>> 
>>> First of all, thanks for the link. It looks like an interesting read. I 
>>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>>> they are evaluating version 3.5.4. The main thing that impressed me was 
>>> their claim that they can beat Cassandra and HBase by 8x for writing and 
>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M 
>>> records per second with only 50 nodes. I wanted to see if this is real.
>>> 
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>> depending on the size of your records and the insertion order. I've been 
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>>> better.
>>>  
>>> 
>>> To answer your questions, we have a DMP with user profiles with many

SnappyData and Structured Streaming

2016-07-05 Thread Benjamin Kim
I recently got a sales email from SnappyData, and after reading the 
documentation about what they offer, it sounds very similar to what Structured 
Streaming will offer w/o the underlying in-memory, spill-to-disk, CRUD 
compliant data storage in SnappyData. I was wondering if Structured Streaming 
is trying to achieve the same on its own or is SnappyData contributing 
Streaming extensions that they built to the Spark project. Lastly, what does 
the Spark community think of this so-called “Spark Data Store”?

Thanks,
Ben
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark interpreter

2016-07-01 Thread Benjamin Kim
Moon,

I have downloaded and tested the bin-all tarball, and it has some deficiencies 
compared to the build-from-source version.
CSV, TSV download is missing
Doesn’t work with HBase 1.2 in CDH 5.7.0
Spark still does not work with Spark 1.6.0 in CDH 5.7.0 (JDK8)
Using Livy is a good workaround
Doesn’t work with Phoenix 4.7 in CDH 5.7.0

Everything else looks good especially in the area of multi-tenancy and 
security. I would like to know how to use the Credentials feature on securing 
usernames and passwords. I couldn’t find documentation on how.

Thanks,
Ben

> On Jul 1, 2016, at 9:04 AM, moon soo Lee <m...@apache.org> wrote:
> 
> 0.6.0 is currently in vote in dev@ list.
> http://apache-zeppelin-dev-mailing-list.75694.x6.nabble.com/VOTE-Apache-Zeppelin-release-0-6-0-rc1-tp11505.html
>  
> <http://apache-zeppelin-dev-mailing-list.75694.x6.nabble.com/VOTE-Apache-Zeppelin-release-0-6-0-rc1-tp11505.html>
> 
> Thanks,
> moon
> 
> On Thu, Jun 30, 2016 at 1:54 PM Leon Katsnelson <l...@ca.ibm.com 
> <mailto:l...@ca.ibm.com>> wrote:
> What is the expected day for v0.6?
> 
> 
> 
> 
> From:moon soo Lee <leemoon...@gmail.com <mailto:leemoon...@gmail.com>>
> To:users@zeppelin.apache.org <mailto:users@zeppelin.apache.org>
> Date:2016/06/30 11:36 AM
> Subject:Re: spark interpreter
> 
> 
> 
> Hi Ben,
> 
> Livy interpreter is included in 0.6.0. If it is not listed when you create 
> interpreter setting, could you check if your 'zeppelin.interpreters' property 
> list Livy interpreter classes? (conf/zeppelin-site.xml)
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> On a side note…
> 
> Has anyone got the Livy interpreter to be added as an interpreter in the 
> latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. 
> Could this interfere?
> 
> Thanks,
> Ben
> 
> 
> On Jun 29, 2016, at 11:18 AM, moon soo Lee <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> 
> Livy interpreter internally creates multiple sessions for each user, 
> independently from 3 binding modes supported in Zeppelin.
> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
> per user.
> 
> Notebook is shared among users, they always use the same interpreter 
> instance/process, for now. I think supporting per user interpreter 
> instance/process would be future work.
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 7:57 AM Chen Song <chen.song...@gmail.com 
> <mailto:chen.song...@gmail.com>> wrote:
> Thanks for your explanation, Moon.
> 
> Following up on this, I can see the difference in terms of single or multiple 
> interpreter processes. 
> 
> With respect to spark drivers, since each interpreter spawns a separate Spark 
> driver in regular Spark interpreter setting, it is clear to me the different 
> implications of the 3 binding modes.
> 
> However, when it comes to Livy server with impersonation turned on, I am a 
> bit confused. Will Livy interpreter always create a new Spark driver (along 
> with a Spark Context instance) for each user session, regardless of the 
> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
> what I could tell, I see no difference between different binding modes for 
> Livy on as far as how Spark drivers are concerned.
> 
> Last question, when a notebook is shared among users, will they always use 
> the same interpreter instance/process already created?
> 
> Thanks
> Chen
> 
> 
> 
> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee <m...@apache.org 
> <mailto:m...@apache.org>> wrote:
> Hi,
> 
> Thanks for asking question. It's not dumb question at all, Zeppelin docs does 
> not explain very well.
> 
> Spark Interpreter, 
> 
> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
> serve all notebooks which binded to this interpreter setting.
> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
> processes per notebook which binded to this interpreter setting.
> 
> Using Livy interpreter,
> 
> Zeppelin propagate current user information to Livy interpreter. And Livy 
> interpreter creates different session per user via Livy Server.
> 
> 
> Hope this helps.
> 
> Thanks,
> moon
> 
> 
> On Tue, Jun 21, 2016 at 6:41 PM Chen Song <chen.song...@gmail.com 
> <mailto:chen.song...@gmail.com>> wrote:
> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
> `shared` Spark interpreter, every user share the same SparkContext. Sorry for 
> the dumb question, how does it differ from Spark via Ivy Server?
> 
> 
> -- 
> Chen Song
> 
> 
> 
> 



Re: Performance Question

2016-06-30 Thread Benjamin Kim
Hi Todd,

I changed the key to be what you suggested, and I can’t tell the difference 
since it was already fast. But, I did get more numbers.

> 104M rows in Kudu table
- read: 8s
- count: 16s
- aggregate: 9s

The time to read took much longer from 0.2s to 8s, counts were the same 16s, 
and aggregate queries look longer from 6s to 9s.

I’m still impressed.

Cheers,
Ben 

> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't have a lot of benchmarks published yet, especially on the write 
>> side. I've found that thorough cross-system benchmarks are very difficult to 
>> do fairly and accurately, and often times users end up misguided if they pay 
>> too much attention to them :) So, given a finite number of developers 
>> working on Kudu, I think we've tended to spend more time on the project 
>> itself and less time focusing on "competition". I'm sure there are use cases 
>> where Kudu will beat out Aerospike, and probably use cases where Aerospike 
>> will beat Kudu as well.
>> 
>> From my perspective, it would be great if you can share some details of your 
>> workload, especially if there are some areas you're finding Kudu lacking. 
>> Maybe we can spot some easy code changes we could make to improve 
>> performance, or suggest a tuning variable you could change.
>> 
>> -Todd
>> 
>> 
>>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>>> <mailto:t...@cloudera.com>> wrote:
>>> 
>>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi Mike,
>>> 
>>> First of all, thanks for the link. It looks like an interesting read. I 
>>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>>> they are evaluating version 3.5.4. The main thing that impressed me was 
>>> their claim that they can beat Cassandra and HBase by 8x for writing and 
>>> 25x for reading. Their big claim to fame is that Aerospike can write 1M 
>>> records per second with only 50 nodes. I wanted to see if this is real.
>>> 
>>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>>> depending on the size of your records and the insertion order. I've been 
>>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>>> better.
>>>  
>>> 
>>> To answer your questions, we have a DMP with user profiles with many 
>>> attributes. We create segmentation information off of these attributes to 
>>> classify them. Then, we can target advertising appropriately for our sales 
>>> department. Much of the data processing is for applying models on all or if 
>>> not most of every profile’s attributes to find similarities (nearest 
>>> neighbor/clustering) over a large number of rows when batch processing or a 
>>> small subset of rows for quick online scoring. So, our use case is a 
>>> typical advanced analytics scenario. We have tried HBase, but it 

Re: spark interpreter

2016-06-30 Thread Benjamin Kim
Moon,

That worked! There were quite a few more configuration properties added, so I 
added those too in both zeppelin-site.xml and zeppelin-env.sh. But, now, I’m 
getting errors starting a spark context.

Thanks,
Ben

> On Jun 30, 2016, at 8:10 AM, moon soo Lee <leemoon...@gmail.com> wrote:
> 
> Hi Ben,
> 
> Livy interpreter is included in 0.6.0. If it is not listed when you create 
> interpreter setting, could you check if your 'zeppelin.interpreters' property 
> list Livy interpreter classes? (conf/zeppelin-site.xml)
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 11:52 AM Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> On a side note…
> 
> Has anyone got the Livy interpreter to be added as an interpreter in the 
> latest build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. 
> Could this interfere?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 29, 2016, at 11:18 AM, moon soo Lee <m...@apache.org 
>> <mailto:m...@apache.org>> wrote:
>> 
>> Livy interpreter internally creates multiple sessions for each user, 
>> independently from 3 binding modes supported in Zeppelin.
>> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
>> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
>> per user.
>> 
>> Notebook is shared among users, they always use the same interpreter 
>> instance/process, for now. I think supporting per user interpreter 
>> instance/process would be future work.
>> 
>> Thanks,
>> moon
>> 
>> On Wed, Jun 29, 2016 at 7:57 AM Chen Song <chen.song...@gmail.com 
>> <mailto:chen.song...@gmail.com>> wrote:
>> Thanks for your explanation, Moon.
>> 
>> Following up on this, I can see the difference in terms of single or 
>> multiple interpreter processes. 
>> 
>> With respect to spark drivers, since each interpreter spawns a separate 
>> Spark driver in regular Spark interpreter setting, it is clear to me the 
>> different implications of the 3 binding modes.
>> 
>> However, when it comes to Livy server with impersonation turned on, I am a 
>> bit confused. Will Livy interpreter always create a new Spark driver (along 
>> with a Spark Context instance) for each user session, regardless of the 
>> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
>> what I could tell, I see no difference between different binding modes for 
>> Livy on as far as how Spark drivers are concerned.
>> 
>> Last question, when a notebook is shared among users, will they always use 
>> the same interpreter instance/process already created?
>> 
>> Thanks
>> Chen
>> 
>> 
>> 
>> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee <m...@apache.org 
>> <mailto:m...@apache.org>> wrote:
>> Hi,
>> 
>> Thanks for asking question. It's not dumb question at all, Zeppelin docs 
>> does not explain very well.
>> 
>> Spark Interpreter, 
>> 
>> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
>> serve all notebooks which binded to this interpreter setting.
>> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
>> processes per notebook which binded to this interpreter setting.
>> 
>> Using Livy interpreter,
>> 
>> Zeppelin propagate current user information to Livy interpreter. And Livy 
>> interpreter creates different session per user via Livy Server.
>> 
>> 
>> Hope this helps.
>> 
>> Thanks,
>> moon
>> 
>> 
>> On Tue, Jun 21, 2016 at 6:41 PM Chen Song <chen.song...@gmail.com 
>> <mailto:chen.song...@gmail.com>> wrote:
>> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
>> `shared` Spark interpreter, every user share the same SparkContext. Sorry 
>> for the dumb question, how does it differ from Spark via Ivy Server?
>> 
>> 
>> -- 
>> Chen Song
>> 
> 



Re: Performance Question

2016-06-29 Thread Benjamin Kim
Todd,

FYI. The key  is unique for every row so rows are not going to already exist. 
Basically, everything is an INSERT.

val generateUUID = udf(() => UUID.randomUUID().toString)

As you can see, we are using UUID java library to create the key.

Cheers,
Ben

> On Jun 29, 2016, at 1:32 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Wed, Jun 29, 2016 at 11:32 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I started Spark streaming more events into Kudu. Performance is great there 
> too! With HBase, it’s fast too, but I noticed that it pauses here and there, 
> making it take seconds for > 40k rows at a time, while Kudu doesn’t. The 
> progress bar just blinks by. I will keep this running until it hits 1B rows 
> and rerun my performance tests. This, hopefully, will give better numbers.
> 
> Cool! We have invested a lot of work in making Kudu have consistent 
> performance, like you mentioned. It's generally been my experience that most 
> mature ops people would prefer a system which consistently performs well 
> rather than one which has higher peak performance but occasionally stalls.
> 
> BTW, what is your row key design? One exception to the above is that, if 
> you're doing random inserts, you may see performance "fall off a cliff" once 
> the size of your key columns becomes larger than the aggregate memory size of 
> your cluster, if you're running on hard disks. Our inserts require checks for 
> duplicate keys, and that can cause random disk IOs if your keys don't fit 
> comfortably in cache. This is one area that HBase is fundamentally going to 
> be faster based on its design.
> 
> -Todd
> 
> 
>> On Jun 28, 2016, at 4:26 PM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Cool, thanks for the report, Ben. For what it's worth, I think there's still 
>> some low hanging fruit in the Spark connector for Kudu (for example, I 
>> believe locality on reads is currently broken). So, you can expect 
>> performance to continue to improve in future versions. I'd also be 
>> interested to see results on Kudu for a much larger dataset - my guess is a 
>> lot of the 6 seconds you're seeing is constant overhead from Spark job 
>> setup, etc, given that the performance doesn't seem to get slower as you 
>> went from 700K rows to 13M rows.
>> 
>> -Todd
>> 
>> On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> FYI.
>> 
>> I did a quick-n-dirty performance test.
>> 
>> First, the setup:
>> QA cluster:
>> 15 data nodes
>> 64GB memory each
>> HBase is using 4GB of memory
>> Kudu is using 1GB of memory
>> 1 HBase/Kudu master node
>> 64GB memory
>> HBase/Kudu master is using 1GB of memory each
>> 10Gb Ethernet
>> 
>> Using Spark on both to load/read events data (84 columns per row), I was 
>> able to record performance for each. On the HBase side, I used the Phoenix 
>> 4.7 Spark plugin where DataFrames can be used directly. On the Kudu side, I 
>> used the Spark connector. I created an events table in Phoenix using the 
>> CREATE TABLE statement and created the equivalent in Kudu using the Spark 
>> method based off of a DataFrame schema.
>> 
>> Here are the numbers for Phoenix/HBase.
>> 1st run:
>> > 715k rows
>> - write: 2.7m
>> 
>> > 715k rows in HBase table
>> - read: 0.1s
>> - count: 3.8s
>> - aggregate: 61s
>> 
>> 2nd run:
>> > 5.2M rows
>> - write: 11m
>> * had 4 region servers go down, had to retry the 5.2M row write
>> 
>> > 5.9M rows in HBase table
>> - read: 8s
>> - count: 3m
>> - aggregate: 46s
>> 
>> 3rd run:
>> > 6.8M rows
>> - write: 9.6m
>> 
>> > 12.7M rows
>> - read: 10s
>> - count: 3m
>> - aggregate: 44s
>> 
>> 
>> Here are the numbers for Kudu.
>> 1st run:
>> > 715k rows
>> - write: 18s
>> 
>> > 715k rows in Kudu table
>> - read: 0.2s
>> - count: 18s
>> - aggregate: 5s
>> 
>> 2nd run:
>> > 5.2M rows
>> - write: 33s
>> 
>> > 5.9M rows in Kudu table
>> - read: 0.2s
>> - count: 16s
>> - aggregate: 6s
>> 
>> 3rd run:
>> > 6.8M rows
>> - write: 27s
>> 
>> > 12.7M rows in Kudu table
>> - read: 0.2s
>> - count: 16s
>> - aggregate: 6s
>> 
>> The Kudu results are impressive if you take these number as-is. Kudu is 
>> clo

Re: spark interpreter

2016-06-29 Thread Benjamin Kim
On a side note…

Has anyone got the Livy interpreter to be added as an interpreter in the latest 
build of Zeppelin 0.6.0? By the way, I have Shiro authentication on. Could this 
interfere?

Thanks,
Ben


> On Jun 29, 2016, at 11:18 AM, moon soo Lee  wrote:
> 
> Livy interpreter internally creates multiple sessions for each user, 
> independently from 3 binding modes supported in Zeppelin.
> Therefore, 'shared' mode, Livy interpreter will create sessions per each 
> user, 'scoped' or 'isolated' mode will result create sessions per notebook, 
> per user.
> 
> Notebook is shared among users, they always use the same interpreter 
> instance/process, for now. I think supporting per user interpreter 
> instance/process would be future work.
> 
> Thanks,
> moon
> 
> On Wed, Jun 29, 2016 at 7:57 AM Chen Song  > wrote:
> Thanks for your explanation, Moon.
> 
> Following up on this, I can see the difference in terms of single or multiple 
> interpreter processes. 
> 
> With respect to spark drivers, since each interpreter spawns a separate Spark 
> driver in regular Spark interpreter setting, it is clear to me the different 
> implications of the 3 binding modes.
> 
> However, when it comes to Livy server with impersonation turned on, I am a 
> bit confused. Will Livy interpreter always create a new Spark driver (along 
> with a Spark Context instance) for each user session, regardless of the 
> binding mode of Livy interpreter? I am not very familiar with Livy, but from 
> what I could tell, I see no difference between different binding modes for 
> Livy on as far as how Spark drivers are concerned.
> 
> Last question, when a notebook is shared among users, will they always use 
> the same interpreter instance/process already created?
> 
> Thanks
> Chen
> 
> 
> 
> On Fri, Jun 24, 2016 at 11:51 AM moon soo Lee  > wrote:
> Hi,
> 
> Thanks for asking question. It's not dumb question at all, Zeppelin docs does 
> not explain very well.
> 
> Spark Interpreter, 
> 
> 'shared' mode, a spark interpreter setting spawn a interpreter process to 
> serve all notebooks which binded to this interpreter setting.
> 'scoped' mode, a spark interpreter setting spawn multiple interpreter 
> processes per notebook which binded to this interpreter setting.
> 
> Using Livy interpreter,
> 
> Zeppelin propagate current user information to Livy interpreter. And Livy 
> interpreter creates different session per user via Livy Server.
> 
> 
> Hope this helps.
> 
> Thanks,
> moon
> 
> 
> On Tue, Jun 21, 2016 at 6:41 PM Chen Song  > wrote:
> Zeppelin provides 3 binding modes for each interpreter. With `scoped` or 
> `shared` Spark interpreter, every user share the same SparkContext. Sorry for 
> the dumb question, how does it differ from Spark via Ivy Server?
> 
> 
> -- 
> Chen Song
> 



Re: Performance Question

2016-06-29 Thread Benjamin Kim
Todd,

I started Spark streaming more events into Kudu. Performance is great there 
too! With HBase, it’s fast too, but I noticed that it pauses here and there, 
making it take seconds for > 40k rows at a time, while Kudu doesn’t. The 
progress bar just blinks by. I will keep this running until it hits 1B rows and 
rerun my performance tests. This, hopefully, will give better numbers.

Thanks,
Ben


> On Jun 28, 2016, at 4:26 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Cool, thanks for the report, Ben. For what it's worth, I think there's still 
> some low hanging fruit in the Spark connector for Kudu (for example, I 
> believe locality on reads is currently broken). So, you can expect 
> performance to continue to improve in future versions. I'd also be interested 
> to see results on Kudu for a much larger dataset - my guess is a lot of the 6 
> seconds you're seeing is constant overhead from Spark job setup, etc, given 
> that the performance doesn't seem to get slower as you went from 700K rows to 
> 13M rows.
> 
> -Todd
> 
> On Tue, Jun 28, 2016 at 3:03 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> FYI.
> 
> I did a quick-n-dirty performance test.
> 
> First, the setup:
> QA cluster:
> 15 data nodes
> 64GB memory each
> HBase is using 4GB of memory
> Kudu is using 1GB of memory
> 1 HBase/Kudu master node
> 64GB memory
> HBase/Kudu master is using 1GB of memory each
> 10Gb Ethernet
> 
> Using Spark on both to load/read events data (84 columns per row), I was able 
> to record performance for each. On the HBase side, I used the Phoenix 4.7 
> Spark plugin where DataFrames can be used directly. On the Kudu side, I used 
> the Spark connector. I created an events table in Phoenix using the CREATE 
> TABLE statement and created the equivalent in Kudu using the Spark method 
> based off of a DataFrame schema.
> 
> Here are the numbers for Phoenix/HBase.
> 1st run:
> > 715k rows
> - write: 2.7m
> 
> > 715k rows in HBase table
> - read: 0.1s
> - count: 3.8s
> - aggregate: 61s
> 
> 2nd run:
> > 5.2M rows
> - write: 11m
> * had 4 region servers go down, had to retry the 5.2M row write
> 
> > 5.9M rows in HBase table
> - read: 8s
> - count: 3m
> - aggregate: 46s
> 
> 3rd run:
> > 6.8M rows
> - write: 9.6m
> 
> > 12.7M rows
> - read: 10s
> - count: 3m
> - aggregate: 44s
> 
> 
> Here are the numbers for Kudu.
> 1st run:
> > 715k rows
> - write: 18s
> 
> > 715k rows in Kudu table
> - read: 0.2s
> - count: 18s
> - aggregate: 5s
> 
> 2nd run:
> > 5.2M rows
> - write: 33s
> 
> > 5.9M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
> 
> 3rd run:
> > 6.8M rows
> - write: 27s
> 
> > 12.7M rows in Kudu table
> - read: 0.2s
> - count: 16s
> - aggregate: 6s
> 
> The Kudu results are impressive if you take these number as-is. Kudu is close 
> to 18x faster at writing (UPSERT). Kudu is 30x faster at reading (HBase times 
> increase as data size grows).  Kudu is 7x faster at full row counts. Lastly, 
> Kudu is 3x faster doing an aggregate query (count distinct event_id’s per 
> user_id). *Remember that this is small cluster, times are still respectable 
> for both systems, HBase could have been configured better, and the HBase 
> table could have been better tuned.
> 
> Cheers,
> Ben
> 
> 
>> On Jun 15, 2016, at 10:13 AM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Adding partition splits when range partitioning is done via the 
>> CreateTableOptions.addSplitRow 
>> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow->
>>  method.  You can find more about the different partitioning options in the 
>> schema design guide 
>> <http://getkudu.io/docs/schema_design.html#data-distribution>.  We generally 
>> recommend sticking to hash partitioning if possible, since you don't have to 
>> determine your own split rows.
>> 
>> - Dan
>> 
>> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> I think the locality is not within our setup. We have the compute cluster 
>> with Spark, YARN, etc. on its own, and we have the storage cluster with 
>> HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute 
>> cluster and beefed up storage capacity on the storage cluster. We got this 
>> setup idea from the Databricks folks. I do have a question. I created the 
>

Kudu Connector

2016-06-29 Thread Benjamin Kim
I was wondering if anyone, who is a Spark Scala developer, would be willing to 
continue the work done for the Kudu connector?

https://github.com/apache/incubator-kudu/tree/master/java/kudu-spark/src/main/scala/org/kududb/spark/kudu

I have been testing and using Kudu for the past month and comparing against 
HBase. It seems like a promising data store to complement Spark. It fills the 
gap in our company as a fast updatable data store. We stream GB’s of data in 
and run analytical queries against it, which run in well below a minute 
typically. According to the Kudu users group, all it needs is to add SQL (JDBC) 
friendly features (CREATE TABLE, intuitive save modes (append = upsert and 
overwrite = truncate + insert), DELETE, etc.) and improve performance by 
implementing locality.

For reference, here is the page on contributing.

http://kudu.apache.org/docs/contributing.html

I am hoping that for individuals in the Spark community it would be relatively 
easy.

Thanks!
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Performance Question

2016-06-28 Thread Benjamin Kim
FYI.

I did a quick-n-dirty performance test.

First, the setup:
QA cluster:
15 data nodes
64GB memory each
HBase is using 4GB of memory
Kudu is using 1GB of memory
1 HBase/Kudu master node
64GB memory
HBase/Kudu master is using 1GB of memory each
10Gb Ethernet

Using Spark on both to load/read events data (84 columns per row), I was able 
to record performance for each. On the HBase side, I used the Phoenix 4.7 Spark 
plugin where DataFrames can be used directly. On the Kudu side, I used the 
Spark connector. I created an events table in Phoenix using the CREATE TABLE 
statement and created the equivalent in Kudu using the Spark method based off 
of a DataFrame schema.

Here are the numbers for Phoenix/HBase.
1st run:
> 715k rows
- write: 2.7m

> 715k rows in HBase table
- read: 0.1s
- count: 3.8s
- aggregate: 61s

2nd run:
> 5.2M rows
- write: 11m
* had 4 region servers go down, had to retry the 5.2M row write

> 5.9M rows in HBase table
- read: 8s
- count: 3m
- aggregate: 46s

3rd run:
> 6.8M rows
- write: 9.6m

> 12.7M rows
- read: 10s
- count: 3m
- aggregate: 44s


Here are the numbers for Kudu.
1st run:
> 715k rows
- write: 18s

> 715k rows in Kudu table
- read: 0.2s
- count: 18s
- aggregate: 5s

2nd run:
> 5.2M rows
- write: 33s

> 5.9M rows in Kudu table
- read: 0.2s
- count: 16s
- aggregate: 6s

3rd run:
> 6.8M rows
- write: 27s

> 12.7M rows in Kudu table
- read: 0.2s
- count: 16s
- aggregate: 6s

The Kudu results are impressive if you take these number as-is. Kudu is close 
to 18x faster at writing (UPSERT). Kudu is 30x faster at reading (HBase times 
increase as data size grows).  Kudu is 7x faster at full row counts. Lastly, 
Kudu is 3x faster doing an aggregate query (count distinct event_id’s per 
user_id). *Remember that this is small cluster, times are still respectable for 
both systems, HBase could have been configured better, and the HBase table 
could have been better tuned.

Cheers,
Ben


> On Jun 15, 2016, at 10:13 AM, Dan Burkert <d...@cloudera.com> wrote:
> 
> Adding partition splits when range partitioning is done via the 
> CreateTableOptions.addSplitRow 
> <http://getkudu.io/apidocs/org/kududb/client/CreateTableOptions.html#addSplitRow-org.kududb.client.PartialRow->
>  method.  You can find more about the different partitioning options in the 
> schema design guide 
> <http://getkudu.io/docs/schema_design.html#data-distribution>.  We generally 
> recommend sticking to hash partitioning if possible, since you don't have to 
> determine your own split rows.
> 
> - Dan
> 
> On Wed, Jun 15, 2016 at 9:17 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> I think the locality is not within our setup. We have the compute cluster 
> with Spark, YARN, etc. on its own, and we have the storage cluster with 
> HBase, Kudu, etc. on another. We beefed up the hardware specs on the compute 
> cluster and beefed up storage capacity on the storage cluster. We got this 
> setup idea from the Databricks folks. I do have a question. I created the 
> table to use range partition on columns. I see that if I use hash partition I 
> can set the number of splits, but how do I do that using range (50 nodes * 10 
> = 500 splits)?
> 
> Thanks,
> Ben
> 
> 
>> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> Awesome use case. One thing to keep in mind is that spark parallelism will 
>> be limited by the number of tablets. So, you might want to split into 10 or 
>> so buckets per node to get the best query throughput.
>> 
>> Usually if you run top on some machines while running the query you can see 
>> if it is fully utilizing the cores.
>> 
>> Another known issue right now is that spark locality isn't working properly 
>> on replicated tables so you will use a lot of network traffic. For a perf 
>> test you might want to try a table with replication count 1
>> 
>> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Todd,
>> 
>> I did a simple test of our ad events. We stream using Spark Streaming 
>> directly into HBase, and the Data Analysts/Scientists do some 
>> insight/discovery work plus some reports generation. For the reports, we use 
>> SQL, and the more deeper stuff, we use Spark. In Spark, our main data 
>> currency store of choice is DataFrames.
>> 
>> The schema is around 83 columns wide where most are of the string data type.
>> 
>> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
>> "user_id", "mappable_id&q

Re: Spark 1.6 (CDH 5.7) and Phoenix 4.7 (CLABS)

2016-06-27 Thread Benjamin Kim
Hi Sean,

I figured out the problem. By putting these jars in the Spark classpath.txt 
file located in Spark conf, this allowed for these to be loaded first. This 
fixed it!

Thanks,
Ben


> On Jun 27, 2016, at 4:20 PM, Sean Busbey <bus...@apache.org> wrote:
> 
> Hi Ben!
> 
> For problems with the Cloudera Labs packaging of Apache Phoenix, you should 
> first seek help on the vendor-specific community forums, to ensure the issue 
> isn't specific to the vendor:
> 
> http://community.cloudera.com/t5/Cloudera-Labs/bd-p/ClouderaLabs
> 
> -busbey
> 
> On 2016-06-27 15:27 (-0500), Benjamin Kim <bbuil...@gmail.com> wrote: 
>> Anyone tried to save a DataFrame to a HBase table using Phoenix? I am able 
>> to load and read, but I canâ?Tt save.
>> 
>>>> spark-shell â?"jars 
>>>> /opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/lib/phoenix-spark-4.7.0-clabs-phoenix1.3.0.jar,/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-4.7.0-clabs-phoenix1.3.0-client.jar
>> 
>> import org.apache.spark.sql._
>> import org.apache.phoenix.spark._
>> 
>> val hbaseConnectionString = â?oâ?
>> 
>> // Save to OUTPUT_TABLE
>> df.save("org.apache.phoenix.spark", SaveMode.Overwrite, Map("table" -> 
>> "OUTPUT_TABLE",
>>  "zkUrl" -> hbaseConnectionString))
>> 
>> java.lang.ClassNotFoundException: Class 
>> org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
>>  at 
>> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
>>  at 
>> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)
>> 
>> Thanks,
>> Ben



Spark 1.6 (CDH 5.7) and Phoenix 4.7 (CLABS)

2016-06-27 Thread Benjamin Kim
Anyone tried to save a DataFrame to a HBase table using Phoenix? I am able to 
load and read, but I can’t save.

>> spark-shell —jars 
>> /opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/lib/phoenix-spark-4.7.0-clabs-phoenix1.3.0.jar,/opt/cloudera/parcels/CLABS_PHOENIX/lib/phoenix/phoenix-4.7.0-clabs-phoenix1.3.0-client.jar

import org.apache.spark.sql._
import org.apache.phoenix.spark._

val hbaseConnectionString = “”

// Save to OUTPUT_TABLE
df.save("org.apache.phoenix.spark", SaveMode.Overwrite, Map("table" -> 
"OUTPUT_TABLE",
  "zkUrl" -> hbaseConnectionString))

java.lang.ClassNotFoundException: Class 
org.apache.phoenix.mapreduce.PhoenixOutputFormat not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2105)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2197)

Thanks,
Ben

livy interpreter not appearing

2016-06-26 Thread Benjamin Kim
Has anyone tried using the livy interpreter? I cannot add it. It just does not 
appear after clicking save.

Thanks,
Ben

Re: phoenix on non-apache hbase

2016-06-25 Thread Benjamin Kim
What a surprise! I see that the phoenix 4.7.0-1.clabs_phoenix1.3.0.p0.000 
parcel has been released by Cloudera. Now, starts the usage tests.

Cheers,
Ben

> On Jun 9, 2016, at 10:09 PM, Ankur Jain <aj...@quadanalytix.com> wrote:
> 
> I have updated my jira with updated instructions 
> https://issues.apache.org/jira/browse/PHOENIX-2834 
> <https://issues.apache.org/jira/browse/PHOENIX-2834>.
> 
> Please do let me know if you are able to build and use with CDH5.7
> 
> Thanks,
> Ankur Jain
> 
> From: Andrew Purtell <andrew.purt...@gmail.com 
> <mailto:andrew.purt...@gmail.com>>
> Reply-To: "user@phoenix.apache.org <mailto:user@phoenix.apache.org>" 
> <user@phoenix.apache.org <mailto:user@phoenix.apache.org>>
> Date: Friday, 10 June 2016 at 9:06 AM
> To: "user@phoenix.apache.org <mailto:user@phoenix.apache.org>" 
> <user@phoenix.apache.org <mailto:user@phoenix.apache.org>>
> Subject: Re: phoenix on non-apache hbase
> 
> Yes a stock client should work with a server modified for CDH assuming both 
> client and server versions are within the bounds specified by the backwards 
> compatibility policy (https://phoenix.apache.org/upgrading.html 
> <https://phoenix.apache.org/upgrading.html>)
> 
> "Phoenix maintains backward compatibility across at least two minor releases 
> to allow for no downtime through server-side rolling restarts upon upgrading."
> 
> 
> On Jun 9, 2016, at 8:09 PM, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> 
>> is phoenix client also affect by this? or does phoenix server isolate the 
>> client? 
>> 
>> is it reasonable to expect a "stock" phoenix client to work against a custom 
>> phoenix server for cdh 5.x? (with of course the phoenix client and server 
>> having same phoenix version).
>> 
>> 
>> 
>> On Thu, Jun 9, 2016 at 10:55 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>>> Andrew,
>>> 
>>> Since we are still on CDH 5.5.2, can I just use your custom version? 
>>> Phoenix is one of the reasons that we are blocked from upgrading to CDH 
>>> 5.7.1. Thus, CDH 5.7.1 is only on our test cluster. One of our developers 
>>> wants to try out the Phoenix Spark plugin. Did you try it out in yours too? 
>>> Does it work if you did?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 9, 2016, at 7:47 PM, Andrew Purtell <andrew.purt...@gmail.com 
>>>> <mailto:andrew.purt...@gmail.com>> wrote:
>>>> 
>>>> >  is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache HBase 
>>>> > 1.2.0?
>>>> 
>>>> Yes
>>>> 
>>>> As is the Cloudera HBase in 5.6, 5.5, 5.4, ... quite different from Apache 
>>>> HBase in coprocessor and RPC internal extension APIs. 
>>>> 
>>>> We have made some ports of Apache Phoenix releases to CDH here: 
>>>> https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5
>>>>  
>>>> <https://github.com/chiastic-security/phoenix-for-cloudera/tree/4.7-HBase-1.0-cdh5.5?files=1>
>>>>  
>>>> 
>>>> It's a personal project of mine, not something supported by the community. 
>>>> Sounds like I should look at what to do with CDH 5.6 and 5.7. 
>>>> 
>>>> On Jun 9, 2016, at 7:37 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>>> This interests me too. I asked Cloudera in their community forums a while 
>>>>> back but got no answer on this. I hope they don’t leave us out in the 
>>>>> cold. I tried building it too before with the instructions here 
>>>>> https://issues.apache.org/jira/browse/PHOENIX-2834 
>>>>> <https://issues.apache.org/jira/browse/PHOENIX-2834>. I could get it to 
>>>>> build, but I couldn’t get it to work using the Phoenix installation 
>>>>> instructions. For some reason, dropping the server jar into CDH 5.7.0 
>>>>> HBase lib directory didn’t change things. HBase seemed not to use it. Now 
>>>>> that this is out, I’ll give it another try hoping that there is a way. If 
>>>>> anyone has any leads to help, please let me know.
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Jun 9, 2016, at 6:39 PM, Josh Elser 

Model Quality Tracking

2016-06-24 Thread Benjamin Kim
Has anyone implemented a way to track the performance of a data model? We 
currently have an algorithm to do record linkage and spit out statistics of 
matches, non-matches, and/or partial matches with reason codes of why we didn’t 
match accurately. In this way, we will know if something goes wrong down the 
line. All of this goes into a csv file directories partitioned by datetime with 
a hive table on top. Then, we can do analytical queries and even charting if 
need be. All of this is very manual, but I was wondering if there is a package, 
software, built-in module, etc. that would do this automatically. Since we are 
using CDH, it would be great if these graphs could be integrated into Cloudera 
Manager too.

Any advice is welcome.

Thanks,
Ben


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Kudu

2016-06-20 Thread Benjamin Kim
Dan,

Out of curiosity, I was looking through the spark-csv code in Github and tried 
to see what makes it work for the “CREATE TABLE” statement, while it doesn’t 
for spark-kudu. There are differences in the way both are done, CsvRelation vs. 
KuduRelation. I’m still learning how this works though and what implications 
these differences are. In your opinion, is this the right place to start?

Thanks,
Ben


> On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com> wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase 

Data Integrity / Model Quality Monitoring

2016-06-17 Thread Benjamin Kim
Has anyone run into this requirement?

We have a need to track data integrity and model quality metrics of outcomes so 
that we can both gauge if the data is healthy coming in and the models run 
against them are still performing and not giving faulty results. A nice to have 
would be to graph these over time somehow. Since we are using Cloudera Manager, 
graphing in there would be a plus.

Any advice or suggestions would be welcome.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Kudu

2016-06-17 Thread Benjamin Kim
Dan,

The roadmap is very informative. I am looking forward to the official 1.0 
release! It would be so much easier for us to use in every aspect compared to 
HBase.

Cheers,
Ben


> On Jun 17, 2016, at 11:08 AM, Dan Burkert <d...@cloudera.com> wrote:
> 
> Hi Ben,
> 
> To your first question about `CREATE TABLE` syntax with Kudu/Spark SQL, I do 
> not think we support that at this point.  I haven't looked deeply into it, 
> but we may hit issues specifying Kudu-specific options (partitioning, column 
> encoding, etc.).  Probably issues that can be worked through eventually, 
> though.  If you are interested in contributing to Kudu, this is an area that 
> could obviously use improvement!  Most or all of our Spark features have been 
> completely community driven to date.
>  
> I am assuming that more Spark support along with semantic changes below will 
> be incorporated into Kudu 0.9.1.
> 
> As a rule we do not release new features in patch releases, but the good news 
> is that we are releasing regularly, and our next scheduled release is for the 
> August timeframe (see JD's roadmap 
> <https://lists.apache.org/thread.html/1a3b949e715a74d7f26bd9c102247441a06d16d077324ba39a662e2a@1455234076@%3Cdev.kudu.apache.org%3E>
>  email about what we are aiming to include).  Also, Cloudera does publish 
> snapshot versions of the Spark connector here 
> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/>, so the 
> jars are available if you don't mind using snapshots.
>  
> Anyone know of a better way to make unique primary keys other than using UUID 
> to make every row unique if there is no unique column (or combination 
> thereof) to use.
> 
> Not that I know of.  In general it's pretty rare to have a dataset without a 
> natural primary key (even if it's just all of the columns), but in those 
> cases UUID is a good solution.
>  
> This is what I am using. I know auto incrementing is coming down the line 
> (don’t know when), but is there a way to simulate this in Kudu using Spark 
> out of curiosity?
> 
> To my knowledge there is no plan to have auto increment in Kudu.  
> Distributed, consistent, auto incrementing counters is a difficult problem, 
> and I don't think there are any known solutions that would be fast enough for 
> Kudu (happy to be proven wrong, though!).
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> I'm not sure exactly what the semantics will be, but at least one of them 
>> will be upsert.  These modes come from spark, and they were really designed 
>> for file-backed storage and not table storage.  We may want to do append = 
>> upsert, and overwrite = truncate + insert.  I think that may match the 
>> normal spark semantics more closely.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks for the information. That would mean both “append” and “overwrite” 
>> modes would be combined or not needed in the future.
>> 
>> Cheers,
>> Ben
>> 
>>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Right now append uses an update Kudu operation, which requires the row 
>>> already be present in the table. Overwrite maps to insert.  Kudu very 
>>> recently got upsert support baked in, but it hasn't yet been integrated 
>>> into the Spark connector.  So pretty soon these sharp edges will get a lot 
>>> better, since upsert is the way to go for most spark workloads.
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>>> data. Now, I have to find answers to these questions. What would happen if 
>>> I “append” to the data in the Kudu table if the data already exists? What 
>>> would happen if I “overwrite” existing data when the DataFrame has data in 
>>> it that does not exist in the Kudu table? I need to evaluate the best way 
>>> to simulate the UPSERT behavior in HBase because this is what our use case 
>>> is.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> 
>>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmai

Re: Spark on Kudu

2016-06-17 Thread Benjamin Kim
I am assuming that more Spark support along with semantic changes below will be 
incorporated into Kudu 0.9.1.

Anyone know of a better way to make unique primary keys other than using UUID 
to make every row unique if there is no unique column (or combination thereof) 
to use.

import java.util.UUID
val generateUUID = udf(() => UUID.randomUUID().toString)

This is what I am using. I know auto incrementing is coming down the line 
(don’t know when), but is there a way to simulate this in Kudu using Spark out 
of curiosity?

Thanks,
Ben

> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote:
> 
> I'm not sure exactly what the semantics will be, but at least one of them 
> will be upsert.  These modes come from spark, and they were really designed 
> for file-backed storage and not table storage.  We may want to do append = 
> upsert, and overwrite = truncate + insert.  I think that may match the normal 
> spark semantics more closely.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks for the information. That would mean both “append” and “overwrite” 
> modes would be combined or not needed in the future.
> 
> Cheers,
> Ben
> 
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Right now append uses an update Kudu operation, which requires the row 
>> already be present in the table. Overwrite maps to insert.  Kudu very 
>> recently got upsert support baked in, but it hasn't yet been integrated into 
>> the Spark connector.  So pretty soon these sharp edges will get a lot 
>> better, since upsert is the way to go for most spark workloads.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>> data. Now, I have to find answers to these questions. What would happen if I 
>> “append” to the data in the Kudu table if the data already exists? What 
>> would happen if I “overwrite” existing data when the DataFrame has data in 
>> it that does not exist in the Kudu table? I need to evaluate the best way to 
>> simulate the UPSERT behavior in HBase because this is what our use case is.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Now, I’m getting this error when trying to write to the table.
>>> 
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>> 
>>> df.write
>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>> .mode("overwrite")
>>> .kudu
>>> 
>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>> (error 0)Not found: key not found (error 0)
>>> 
>>> Does the key field need to be first in the DataFrame?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com 
>>>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>> 
>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>> 
>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>> using setRangePartitionColumns or addHashPartitions
>>>> 
>>>> Yep.  The `Seq("my_id")` part of that call is sp

Re: Ask opinion regarding 0.6.0 release package

2016-06-17 Thread Benjamin Kim
Hi,

Our company’s use is spark, phoenix, jdbc/psql. So, if you make different 
packages, I would need the full one. In addition, for the minimized one, would 
there be a way to pick and choose interpreters to add/plug in?

Thanks,
Ben

> On Jun 17, 2016, at 1:02 AM, mina lee  wrote:
> 
> Hi all!
> 
> Zeppelin just started release process. Prior to creating release candidate I 
> want to ask users' opinion about how you want it to be packaged.
> 
> For the last release(0.5.6), we have released one binary package which 
> includes all interpreters.
> The concern with providing one type of binary package is that package size 
> will be quite big(~600MB).
> So I am planning to provide two binary packages:
>   - zeppelin-0.6.0-bin-all.tgz (includes all interpreters)
>   - zeppelin-0.6.0-bin-min.tgz (includes only most used interpreters)
> 
> I am thinking about putting spark(pyspark, sparkr, sql), python, jdbc, shell, 
> markdown, angular in minimized package.
> Could you give your opinion on whether these sets are enough, or some of them 
> are ok to be excluded?
> 
> Community's opinion will be helpful to make decision not only for 0.6.0 but 
> also for 0.7.0 release since we are planning to provide only minimized 
> package from 0.7.0 release. From the 0.7.0 version, interpreters those are 
> not included in binary package will be able to use dynamic interpreter 
> feature [1] which is in progress under [2].
> 
> Thanks,
> Mina
> 
> [1] 
> http://zeppelin.apache.org/docs/0.6.0-SNAPSHOT/manual/dynamicinterpreterload.html
>  
> 
> [2] https://github.com/apache/zeppelin/pull/908 
> 


Re: Spark on Kudu

2016-06-15 Thread Benjamin Kim
Since I have created permanent tables using org.apache.spark.sql.jdbc and 
com.databricks.spark.csv with sqlContext, I was wondering if I can do the same 
with Kudu tables?

CREATE TABLE 
USING org.kududb.spark.kudu
OPTIONS ("kudu.master” "kudu_master","kudu.table” "kudu_tablename”)

Is this possible? By the way, the above didn’t work for me.

Thanks,
Ben

> On Jun 14, 2016, at 6:08 PM, Dan Burkert <d...@cloudera.com> wrote:
> 
> I'm not sure exactly what the semantics will be, but at least one of them 
> will be upsert.  These modes come from spark, and they were really designed 
> for file-backed storage and not table storage.  We may want to do append = 
> upsert, and overwrite = truncate + insert.  I think that may match the normal 
> spark semantics more closely.
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 6:00 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks for the information. That would mean both “append” and “overwrite” 
> modes would be combined or not needed in the future.
> 
> Cheers,
> Ben
> 
>> On Jun 14, 2016, at 5:57 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Right now append uses an update Kudu operation, which requires the row 
>> already be present in the table. Overwrite maps to insert.  Kudu very 
>> recently got upsert support baked in, but it hasn't yet been integrated into 
>> the Spark connector.  So pretty soon these sharp edges will get a lot 
>> better, since upsert is the way to go for most spark workloads.
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 5:41 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I tried to use the “append” mode, and it worked. Over 3.8 million rows in 
>> 64s. I would assume that now I can use the “overwrite” mode on existing 
>> data. Now, I have to find answers to these questions. What would happen if I 
>> “append” to the data in the Kudu table if the data already exists? What 
>> would happen if I “overwrite” existing data when the DataFrame has data in 
>> it that does not exist in the Kudu table? I need to evaluate the best way to 
>> simulate the UPSERT behavior in HBase because this is what our use case is.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> 
>>> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> Hi,
>>> 
>>> Now, I’m getting this error when trying to write to the table.
>>> 
>>> import scala.collection.JavaConverters._
>>> val key_seq = Seq(“my_id")
>>> val key_list = List(“my_id”).asJava
>>> kuduContext.createTable(tableName, df.schema, key_seq, new 
>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
>>> 
>>> df.write
>>> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
>>> .mode("overwrite")
>>> .kudu
>>> 
>>> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to 
>>> Kudu; sample errors: Not found: key not found (error 0)Not found: key not 
>>> found (error 0)Not found: key not found (error 0)Not found: key not found 
>>> (error 0)Not found: key not found (error 0)
>>> 
>>> Does the key field need to be first in the DataFrame?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com 
>>>> <mailto:d...@cloudera.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Dan,
>>>> 
>>>> Thanks! It got further. Now, how do I set the Primary Key to be a 
>>>> column(s) in the DataFrame and set the partitioning? Is it like this?
>>>> 
>>>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>>>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>>>> 
>>>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>>>> using setRangePartitionColumns or addHashPartitions
>>>> 
>>>> Yep.  The `Seq("my_id")` part of that call is specifying the set of 
>>>> primary key columns, so in this case you have specified the single PK 
>>>> column "my_id".  The `addHas

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Todd,

I think the locality is not within our setup. We have the compute cluster with 
Spark, YARN, etc. on its own, and we have the storage cluster with HBase, Kudu, 
etc. on another. We beefed up the hardware specs on the compute cluster and 
beefed up storage capacity on the storage cluster. We got this setup idea from 
the Databricks folks. I do have a question. I created the table to use range 
partition on columns. I see that if I use hash partition I can set the number 
of splits, but how do I do that using range (50 nodes * 10 = 500 splits)?

Thanks,
Ben

> On Jun 15, 2016, at 9:11 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Awesome use case. One thing to keep in mind is that spark parallelism will be 
> limited by the number of tablets. So, you might want to split into 10 or so 
> buckets per node to get the best query throughput.
> 
> Usually if you run top on some machines while running the query you can see 
> if it is fully utilizing the cores.
> 
> Another known issue right now is that spark locality isn't working properly 
> on replicated tables so you will use a lot of network traffic. For a perf 
> test you might want to try a table with replication count 1
> 
> On Jun 15, 2016 5:26 PM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> I did a simple test of our ad events. We stream using Spark Streaming 
> directly into HBase, and the Data Analysts/Scientists do some 
> insight/discovery work plus some reports generation. For the reports, we use 
> SQL, and the more deeper stuff, we use Spark. In Spark, our main data 
> currency store of choice is DataFrames.
> 
> The schema is around 83 columns wide where most are of the string data type.
> 
> "event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
> "user_id", "mappable_id",
> "cookie_status", "profile_status", "user_status", "previous_timestamp", 
> "user_agent", "referer",
> "host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", 
> "creative_id",
> "location_id", “pcamp_id",
> "pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", 
> "isp", "line_speed",
> "gender", "year_of_birth", "behaviors_read", "behaviors_written", 
> "key_value_pairs", "acamp_candidates",
> "tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", 
> "pixel_id", “video_id",
> "video_network_id", "video_time_watched", "video_percentage_watched", 
> "video_media_type",
> "video_player_iframed", "video_player_in_view", "video_player_width", 
> "video_player_height",
> "conversion_valid_sale", "conversion_sale_amount", 
> "conversion_commission_amount", "conversion_step",
> "conversion_currency", "conversion_attribution", "conversion_offer_id", 
> "custom_info", "frequency",
> "recency_seconds", "cost", "revenue", “optimizer_acamp_id",
> "optimizer_creative_id", "optimizer_ecpm", "impression_id", "diagnostic_data",
> "user_profile_mapping_source", "latitude", "longitude", "area_code", 
> "gmt_offset", "in_dst",
> "proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", 
> "timestamp_iso", "reference_id",
> "identity_organization", "identity_method"
> 
> Most queries are like counts of how many users use what browser, how many are 
> unique users, etc. The part that scares most users is when it comes to 
> joining this data with other dimension/3rd party events tables because of 
> shear size of it.
> 
> We do what most companies do, similar to what I saw in earlier presentations 
> of Kudu. We dump data out of HBase into partitioned Parquet tables to make 
> query performance manageable.
> 
> I will coordinate with a data scientist today to do some tests. He is working 
> on identity matching/record linking of users from 2 domains: US and 
> Singapore, using probabilistic deduping algorithms. I will load the data from 
> ad events from both countries, and let him run his process against this data 
> in Kudu. I hope this will “w

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Hi Todd,

I did a simple test of our ad events. We stream using Spark Streaming directly 
into HBase, and the Data Analysts/Scientists do some insight/discovery work 
plus some reports generation. For the reports, we use SQL, and the more deeper 
stuff, we use Spark. In Spark, our main data currency store of choice is 
DataFrames.

The schema is around 83 columns wide where most are of the string data type.

"event_type", "timestamp", "event_valid", "event_subtype", "user_ip", 
"user_id", "mappable_id",
"cookie_status", "profile_status", "user_status", "previous_timestamp", 
"user_agent", "referer",
"host_domain", "uri", "request_elapsed", "browser_languages", "acamp_id", 
"creative_id",
"location_id", “pcamp_id",
"pdomain_id", "continent_code", "country", "region", "dma", "city", "zip", 
"isp", "line_speed",
"gender", "year_of_birth", "behaviors_read", "behaviors_written", 
"key_value_pairs", "acamp_candidates",
"tag_format", "optimizer_name", "optimizer_version", "optimizer_ip", 
"pixel_id", “video_id",
"video_network_id", "video_time_watched", "video_percentage_watched", 
"video_media_type",
"video_player_iframed", "video_player_in_view", "video_player_width", 
"video_player_height",
"conversion_valid_sale", "conversion_sale_amount", 
"conversion_commission_amount", "conversion_step",
"conversion_currency", "conversion_attribution", "conversion_offer_id", 
"custom_info", "frequency",
"recency_seconds", "cost", "revenue", “optimizer_acamp_id",
"optimizer_creative_id", "optimizer_ecpm", "impression_id", "diagnostic_data",
"user_profile_mapping_source", "latitude", "longitude", "area_code", 
"gmt_offset", "in_dst",
"proxy_type", "mobile_carrier", "pop", "hostname", "profile_expires", 
"timestamp_iso", "reference_id",
"identity_organization", "identity_method"

Most queries are like counts of how many users use what browser, how many are 
unique users, etc. The part that scares most users is when it comes to joining 
this data with other dimension/3rd party events tables because of shear size of 
it.

We do what most companies do, similar to what I saw in earlier presentations of 
Kudu. We dump data out of HBase into partitioned Parquet tables to make query 
performance manageable.

I will coordinate with a data scientist today to do some tests. He is working 
on identity matching/record linking of users from 2 domains: US and Singapore, 
using probabilistic deduping algorithms. I will load the data from ad events 
from both countries, and let him run his process against this data in Kudu. I 
hope this will “wow” the team.

Thanks,
Ben

> On Jun 15, 2016, at 12:47 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> Hi Benjamin,
> 
> What workload are you using for benchmarks? Using spark or something more 
> custom? rdd or data frame or SQL, etc? Maybe you can share the schema and 
> some queries
> 
> Todd
> 
> Todd
> 
> On Jun 15, 2016 8:10 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Todd,
> 
> Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
> Compared to HBase, read and write performance are better. Write performance 
> has the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are 
> only preliminary tests. Do you know of a way to really do some conclusive 
> tests? I want to see if I can match your results on my 50 node cluster.
> 
> Thanks,
> Ben
> 
>> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Todd,
>> 
>> It sounds like Kudu can possibly top or match those numbers put out by 
>> Aerospike. Do you have any performance statistics published or any 
>> instructions as to measure them myself as good way to test? In addition, 
>> this will be a test using Spark, so should I wait for Kudu version 0.9.0 
>> where support will be built in?
>> 
>> We don't hav

Re: Performance Question

2016-06-15 Thread Benjamin Kim
Hi Todd,

Now that Kudu 0.9.0 is out. I have done some tests. Already, I am impressed. 
Compared to HBase, read and write performance are better. Write performance has 
the greatest improvement (> 4x), while read is > 1.5x. Albeit, these are only 
preliminary tests. Do you know of a way to really do some conclusive tests? I 
want to see if I can match your results on my 50 node cluster.

Thanks,
Ben

> On May 30, 2016, at 10:33 AM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Sat, May 28, 2016 at 7:12 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Todd,
> 
> It sounds like Kudu can possibly top or match those numbers put out by 
> Aerospike. Do you have any performance statistics published or any 
> instructions as to measure them myself as good way to test? In addition, this 
> will be a test using Spark, so should I wait for Kudu version 0.9.0 where 
> support will be built in?
> 
> We don't have a lot of benchmarks published yet, especially on the write 
> side. I've found that thorough cross-system benchmarks are very difficult to 
> do fairly and accurately, and often times users end up misguided if they pay 
> too much attention to them :) So, given a finite number of developers working 
> on Kudu, I think we've tended to spend more time on the project itself and 
> less time focusing on "competition". I'm sure there are use cases where Kudu 
> will beat out Aerospike, and probably use cases where Aerospike will beat 
> Kudu as well.
> 
> From my perspective, it would be great if you can share some details of your 
> workload, especially if there are some areas you're finding Kudu lacking. 
> Maybe we can spot some easy code changes we could make to improve 
> performance, or suggest a tuning variable you could change.
> 
> -Todd
> 
> 
>> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com 
>> <mailto:t...@cloudera.com>> wrote:
>> 
>> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi Mike,
>> 
>> First of all, thanks for the link. It looks like an interesting read. I 
>> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
>> they are evaluating version 3.5.4. The main thing that impressed me was 
>> their claim that they can beat Cassandra and HBase by 8x for writing and 25x 
>> for reading. Their big claim to fame is that Aerospike can write 1M records 
>> per second with only 50 nodes. I wanted to see if this is real.
>> 
>> 1M records per second on 50 nodes is pretty doable by Kudu as well, 
>> depending on the size of your records and the insertion order. I've been 
>> playing with a ~70 node cluster recently and seen 1M+ writes/second 
>> sustained, and bursting above 4M. These are 1KB rows with 11 columns, and 
>> with pretty old HDD-only nodes. I think newer flash-based nodes could do 
>> better.
>>  
>> 
>> To answer your questions, we have a DMP with user profiles with many 
>> attributes. We create segmentation information off of these attributes to 
>> classify them. Then, we can target advertising appropriately for our sales 
>> department. Much of the data processing is for applying models on all or if 
>> not most of every profile’s attributes to find similarities (nearest 
>> neighbor/clustering) over a large number of rows when batch processing or a 
>> small subset of rows for quick online scoring. So, our use case is a typical 
>> advanced analytics scenario. We have tried HBase, but it doesn’t work well 
>> for these types of analytics.
>> 
>> I read, that Aerospike in the release notes, they did do many improvements 
>> for batch and scan operations.
>> 
>> I wonder what your thoughts are for using Kudu for this.
>> 
>> Sounds like a good Kudu use case to me. I've heard great things about 
>> Aerospike for the low latency random access portion, but I've also heard 
>> that it's _very_ expensive, and not particularly suited to the columnar scan 
>> workload. Lastly, I think the Apache license of Kudu is much more appealing 
>> than the AGPL3 used by Aerospike. But, that's not really a direct answer to 
>> the performance question :)
>>  
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com 
>>> <mailto:mpe...@cloudera.com>> wrote:
>>> 
>>> Have you considered whether you have a scan heavy or a random access heavy 
>>> workload? Have you considered whether you always access / update a whole 
>>> row vs only a partial r

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim
I tried to use the “append” mode, and it worked. Over 3.8 million rows in 64s. 
I would assume that now I can use the “overwrite” mode on existing data. Now, I 
have to find answers to these questions. What would happen if I “append” to the 
data in the Kudu table if the data already exists? What would happen if I 
“overwrite” existing data when the DataFrame has data in it that does not exist 
in the Kudu table? I need to evaluate the best way to simulate the UPSERT 
behavior in HBase because this is what our use case is.

Thanks,
Ben


> On Jun 14, 2016, at 5:05 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> Hi,
> 
> Now, I’m getting this error when trying to write to the table.
> 
> import scala.collection.JavaConverters._
> val key_seq = Seq(“my_id")
> val key_list = List(“my_id”).asJava
> kuduContext.createTable(tableName, df.schema, key_seq, new 
> CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))
> 
> df.write
> .options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
> .mode("overwrite")
> .kudu
> 
> java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
> sample errors: Not found: key not found (error 0)Not found: key not found 
> (error 0)Not found: key not found (error 0)Not found: key not found (error 
> 0)Not found: key not found (error 0)
> 
> Does the key field need to be first in the DataFrame?
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> 
>> 
>> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Dan,
>> 
>> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
>> in the DataFrame and set the partitioning? Is it like this?
>> 
>> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
>> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
>> 
>> java.lang.IllegalArgumentException: Table partitioning must be specified 
>> using setRangePartitionColumns or addHashPartitions
>> 
>> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
>> key columns, so in this case you have specified the single PK column 
>> "my_id".  The `addHashPartitions` call adds hash partitioning to the table, 
>> in this case over the column "my_id" (which is good, it must be over one or 
>> more PK columns, so in this case "my_id" is the one and only valid 
>> combination).  However, the call to `addHashPartition` also takes the number 
>> of buckets as the second param.  You shouldn't get the 
>> IllegalArgumentException as long as you are specifying either 
>> `addHashPartitions` or `setRangePartitionColumns`.
>> 
>> - Dan
>>  
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com 
>>> <mailto:d...@cloudera.com>> wrote:
>>> 
>>> Looks like we're missing an import statement in that example.  Could you 
>>> try:
>>> 
>>> import org.kududb.client._
>>> and try again?
>>> 
>>> - Dan
>>> 
>>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> I encountered an error trying to create a table based on the documentation 
>>> from a DataFrame.
>>> 
>>> :49: error: not found: type CreateTableOptions
>>>   kuduContext.createTable(tableName, df.schema, Seq("key"), new 
>>> CreateTableOptions().setNumReplicas(1))
>>> 
>>> Is there something I’m missing?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> It's only in Cloudera's maven repo: 
>>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>>  
>>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi J-D,
>>>> 
>>>> I installed Kudu 0.9.0 using CM, but I can’t find t

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim
Hi,

Now, I’m getting this error when trying to write to the table.

import scala.collection.JavaConverters._
val key_seq = Seq(“my_id")
val key_list = List(“my_id”).asJava
kuduContext.createTable(tableName, df.schema, key_seq, new 
CreateTableOptions().setNumReplicas(1).addHashPartitions(key_list, 100))

df.write
.options(Map("kudu.master" -> kuduMaster,"kudu.table" -> tableName))
.mode("overwrite")
.kudu

java.lang.RuntimeException: failed to write 1000 rows from DataFrame to Kudu; 
sample errors: Not found: key not found (error 0)Not found: key not found 
(error 0)Not found: key not found (error 0)Not found: key not found (error 
0)Not found: key not found (error 0)

Does the key field need to be first in the DataFrame?

Thanks,
Ben

> On Jun 14, 2016, at 4:28 PM, Dan Burkert <d...@cloudera.com> wrote:
> 
> 
> 
> On Tue, Jun 14, 2016 at 4:20 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Dan,
> 
> Thanks! It got further. Now, how do I set the Primary Key to be a column(s) 
> in the DataFrame and set the partitioning? Is it like this?
> 
> kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
> CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))
> 
> java.lang.IllegalArgumentException: Table partitioning must be specified 
> using setRangePartitionColumns or addHashPartitions
> 
> Yep.  The `Seq("my_id")` part of that call is specifying the set of primary 
> key columns, so in this case you have specified the single PK column "my_id". 
>  The `addHashPartitions` call adds hash partitioning to the table, in this 
> case over the column "my_id" (which is good, it must be over one or more PK 
> columns, so in this case "my_id" is the one and only valid combination).  
> However, the call to `addHashPartition` also takes the number of buckets as 
> the second param.  You shouldn't get the IllegalArgumentException as long as 
> you are specifying either `addHashPartitions` or `setRangePartitionColumns`.
> 
> - Dan
>  
> 
> Thanks,
> Ben
> 
> 
>> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com 
>> <mailto:d...@cloudera.com>> wrote:
>> 
>> Looks like we're missing an import statement in that example.  Could you try:
>> 
>> import org.kududb.client._
>> and try again?
>> 
>> - Dan
>> 
>> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I encountered an error trying to create a table based on the documentation 
>> from a DataFrame.
>> 
>> :49: error: not found: type CreateTableOptions
>>   kuduContext.createTable(tableName, df.schema, Seq("key"), new 
>> CreateTableOptions().setNumReplicas(1))
>> 
>> Is there something I’m missing?
>> 
>> Thanks,
>> Ben
>> 
>>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> It's only in Cloudera's maven repo: 
>>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>>  
>>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>>> 
>>> J-D
>>> 
>>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Hi J-D,
>>> 
>>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>>> spark-shell to use. Can you show me where to find it?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> What's in this doc is what's gonna get released: 
>>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>>  
>>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>>> 
>>>> J-D
>>>> 
>>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Will this be documented with examples once 0.9.0 comes out?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>

Re: Spark on Kudu

2016-06-14 Thread Benjamin Kim
Dan,

Thanks! It got further. Now, how do I set the Primary Key to be a column(s) in 
the DataFrame and set the partitioning? Is it like this?

kuduContext.createTable(tableName, df.schema, Seq(“my_id"), new 
CreateTableOptions().setNumReplicas(1).addHashPartitions(“my_id"))

java.lang.IllegalArgumentException: Table partitioning must be specified using 
setRangePartitionColumns or addHashPartitions

Thanks,
Ben


> On Jun 14, 2016, at 4:07 PM, Dan Burkert <d...@cloudera.com> wrote:
> 
> Looks like we're missing an import statement in that example.  Could you try:
> 
> import org.kududb.client._
> and try again?
> 
> - Dan
> 
> On Tue, Jun 14, 2016 at 4:01 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I encountered an error trying to create a table based on the documentation 
> from a DataFrame.
> 
> :49: error: not found: type CreateTableOptions
>   kuduContext.createTable(tableName, df.schema, Seq("key"), new 
> CreateTableOptions().setNumReplicas(1))
> 
> Is there something I’m missing?
> 
> Thanks,
> Ben
> 
>> On Jun 14, 2016, at 3:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> It's only in Cloudera's maven repo: 
>> https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/
>>  
>> <https://repository.cloudera.com/cloudera/cloudera-repos/org/kududb/kudu-spark_2.10/0.9.0/>
>> 
>> J-D
>> 
>> On Tue, Jun 14, 2016 at 2:59 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi J-D,
>> 
>> I installed Kudu 0.9.0 using CM, but I can’t find the kudu-spark jar for 
>> spark-shell to use. Can you show me where to find it?
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Jun 8, 2016, at 1:19 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> What's in this doc is what's gonna get released: 
>>> https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark
>>>  
>>> <https://github.com/cloudera/kudu/blob/master/docs/developing.adoc#kudu-integration-with-spark>
>>> 
>>> J-D
>>> 
>>> On Tue, Jun 7, 2016 at 8:52 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Will this be documented with examples once 0.9.0 comes out?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On May 28, 2016, at 3:22 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>>> <mailto:jdcry...@apache.org>> wrote:
>>>> 
>>>> It will be in 0.9.0.
>>>> 
>>>> J-D
>>>> 
>>>> On Sat, May 28, 2016 at 8:31 AM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Hi Chris,
>>>> 
>>>> Will all this effort be rolled into 0.9.0 and be ready for use?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>>> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com 
>>>>> <mailto:christopher.geo...@rms.com>> wrote:
>>>>> 
>>>>> There is some code in review that needs some more refinement.
>>>>> It will allow upsert/insert from a dataframe using the datasource api. It 
>>>>> will also allow the creation and deletion of tables from a dataframe
>>>>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/>
>>>>> 
>>>>> Example usages will look something like:
>>>>> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
>>>>> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
>>>>> 
>>>>> -Chris George
>>>>> 
>>>>> 
>>>>> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> 
>>>>> Can someone tell me what the state is of this Spark work?
>>>>> 
>>>>> Also, does anyone have any sample code on how to update/insert data in 
>>>>> Kudu using DataFrames?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>>> On Apr 13, 2016, at 8:22 AM, Chris George <chris

Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-14 Thread Benjamin Kim
Hi J-D,

I would like to get this started especially now that UPSERT and Spark SQL 
DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a 
parcel available yet? Is there a new CSD file to be downloaded? I currently 
have CM 5.7.0 installed.

Thanks,
Ben



> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans  wrote:
> 
> The Apache Kudu (incubating) team is happy to announce the release of Kudu 
> 0.9.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds basic UPSERT functionality and an improved Apache 
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also 
> improves Tablet Server restart time as well as write performance under high 
> load. Finally, Kudu now enforces the specification of a partitioning scheme 
> for new tables.
> 
> Download it here: http://getkudu.io/releases/0.9.0/ 
> 
> 
> Regards,
> 
> The Apache Kudu (incubating) team
> 
> ===
> 
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache 
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.



Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Benjamin Kim
Hi J-D,

I would like to get this started especially now that UPSERT and Spark SQL 
DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a 
parcel available yet? Is there a new CSD file to be downloaded? I currently 
have CM 5.7.0 installed.

Thanks,
Ben



> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans  wrote:
> 
> The Apache Kudu (incubating) team is happy to announce the release of Kudu 
> 0.9.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds basic UPSERT functionality and an improved Apache 
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also 
> improves Tablet Server restart time as well as write performance under high 
> load. Finally, Kudu now enforces the specification of a partitioning scheme 
> for new tables.
> 
> Download it here: http://getkudu.io/releases/0.9.0/ 
> 
> 
> Regards,
> 
> The Apache Kudu (incubating) team
> 
> ===
> 
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache 
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.



Re: [ANNOUNCE] Apache Kudu (incubating) 0.9.0 released

2016-06-13 Thread Benjamin Kim
Hi J-D,

I would like to get this started especially now that UPSERT and Spark SQL 
DataFrames support. But, how do I use Cloudera Manager to deploy it? Is there a 
parcel available yet? Is there a new CSD file to be downloaded? I currently 
have CM 5.7.0 installed.

Thanks,
Ben



> On Jun 10, 2016, at 7:39 AM, Jean-Daniel Cryans  wrote:
> 
> The Apache Kudu (incubating) team is happy to announce the release of Kudu 
> 0.9.0!
> 
> Kudu is an open source storage engine for structured data which supports 
> low-latency random access together with efficient analytical access patterns. 
> It is designed within the context of the Apache Hadoop ecosystem and supports 
> many integrations with other data analytics projects both inside and outside 
> of the Apache Software Foundation.
> 
> This latest version adds basic UPSERT functionality and an improved Apache 
> Spark Data Source that doesn’t rely on the MapReduce I/O formats. It also 
> improves Tablet Server restart time as well as write performance under high 
> load. Finally, Kudu now enforces the specification of a partitioning scheme 
> for new tables.
> 
> Download it here: http://getkudu.io/releases/0.9.0/ 
> 
> 
> Regards,
> 
> The Apache Kudu (incubating) team
> 
> ===
> 
> Apache Kudu (incubating) is an effort undergoing incubation at The Apache 
> Software
> Foundation (ASF), sponsored by the Apache Incubator PMC. Incubation is
> required of all newly accepted projects until a further review
> indicates that the infrastructure, communications, and decision making
> process have stabilized in a manner consistent with other successful
> ASF projects. While incubation status is not necessarily a reflection
> of the completeness or stability of the code, it does indicate that
> the project has yet to be fully endorsed by the ASF.



Re: phoenix on non-apache hbase

2016-06-09 Thread Benjamin Kim
This interests me too. I asked Cloudera in their community forums a while back 
but got no answer on this. I hope they don’t leave us out in the cold. I tried 
building it too before with the instructions here 
https://issues.apache.org/jira/browse/PHOENIX-2834. I could get it to build, 
but I couldn’t get it to work using the Phoenix installation instructions. For 
some reason, dropping the server jar into CDH 5.7.0 HBase lib directory didn’t 
change things. HBase seemed not to use it. Now that this is out, I’ll give it 
another try hoping that there is a way. If anyone has any leads to help, please 
let me know.

Thanks,
Ben


> On Jun 9, 2016, at 6:39 PM, Josh Elser  wrote:
> 
> Koert,
> 
> Apache Phoenix goes through a lot of work to provide multiple versions of 
> Phoenix for various versions of Apache HBase (0.98, 1.1, and 1.2 presently). 
> The builds for each of these branches are tested against those specific 
> versions of HBase, so I doubt that there are issues between Apache Phoenix 
> and the corresponding version of Apache HBase.
> 
> In general, I believe older versions of Phoenix clients can work against 
> newer versions of Phoenix running in HBase; but, of course, you'd be much 
> better off using equivalent versions on both client and server.
> 
> If you are having issues running Apache Phoenix over vendor-creations of 
> HBase, I would encourage you to reach out on said-vendor's support channels.
> 
> - Josh
> 
> Koert Kuipers wrote:
>> hello all,
>> 
>> i decided i wanted to give phoenix a try on our cdh 5.7.0 cluster. so i
>> download phoenix, see that the master is already for hbase 1.2.0, change
>> the hbase version to 1.2.0-cdh5.7.0, and tell maven to run tests make
>> the package, expecting not much trouble.
>> 
>> but i was wrong... plenty of compilation errors, and some serious
>> incompatibilities (tetra?).
>> 
>> yikes. what happened? why is it so hard to compile for a distro's hbase?
>> i do this all the time for vendor-specific hadoop versions without
>> issues. is cloudera's hbase 1.2.0-cdh5.7.0 that different from apache
>> hbase 1.2.0?
>> 
>> assuming i get the phoenix-server working for hbase 1.2.0-cdh5.7.0, how
>> sensitive is the phoenix-client to the hbase version? can i at least
>> assume all the pain is in the phoenix-server and i can ship a generic
>> phoenix-client with my software that works on all clusters with the same
>> phoenix-server version installed?
>> 
>> thanks! best, koert



Github Integration

2016-06-09 Thread Benjamin Kim
I heard that Zeppelin 0.6.0 is able to use its local notebook directory as a 
Github repo. Does anyone know of a way to have it work (workaround) with our 
company’s Github (Stash) repo server?

Any advice would be welcome.

Thanks,
Ben

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
e  string
> timestamp   string
> event_valid int
> event_subtype   string
> user_ip string
> user_id string
> cookie_status   string
> profile_status  string
> user_status string
> previous_timestamp  string
> user_agent  string
> referer string
> uri string
> request_elapsed bigint
> browser_languages   string
> acamp_idint
> creative_id int
> location_id int
> pcamp_idint
> pdomain_id  int
> country string
> region  string
> dma int
> citystring
> zip string
> isp string
> line_speed  string
> gender  string
> year_of_birth   int
> behaviors_read  string
> behaviors_written   string
> key_value_pairs string
> acamp_candidatesint
> tag_format  string
> optimizer_name  string
> optimizer_version   string
> optimizer_ipstring
> pixel_idint
> video_idstring
> video_network_idint
> video_time_watched  bigint
> video_percentage_watchedint
> conversion_valid_sale   int
> conversion_sale_amount  float
> conversion_commission_amountfloat
> conversion_step int
> conversion_currency string
> conversion_attribution  int
> conversion_offer_id string
> custom_info string
> frequency   int
> recency_seconds int
> costfloat
> revenue float
> optimizer_acamp_id  int
> optimizer_creative_id   int
> optimizer_ecpm  float
> event_idstring
> impression_id   string
> diagnostic_data string
> user_profile_mapping_source string
> latitudefloat
> longitude   float
> area_code   int
> gmt_offset  string
> in_dst  string
> proxy_type  string
> mobile_carrier  string
> pop string
> hostnamestring
> profile_ttl string
> timestamp_iso   string
> reference_idstring
> identity_organization   string
> identity_method string
> mappable_id string
> profile_expires string
> video_player_iframedint
> video_player_in_viewint
> video_player_width  int
> video_player_height int
> host_domain string
> browser_typestring
> browser_device_cat  string
> browser_family  string
> browser_namestring
> browser_version string
> browser_major_version   string
> browser_minor_version   string
> os_family   string
> os_name string
> os_version  string
> os_major_versionstring
> os_minor_versionstring
> # Partition Information
> # col_name  data_type   comment
> dt  timestamp
> # Detailed Table Information
> Database:   test
> Owner:  hduser
> CreateTime: Fri Jun 03 19:03:20 BST 2016
> LastAccessTime: UNKNOWN
> Retention:  0
> Location:   
> hdfs://rhes564:9000/user/hive/warehouse/test.db/amo_bi_events
> Table Type: EXTERNAL_TABLE
> Table Parameters:
> EXTERNALTRUE
> transient_lastDdlTime   1464977000
> # Storage Information
> SerDe Library:  
> org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat:
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat:   
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> Compressed: No
> Num Buckets:-1
> Bucket Columns: []
> Sort Columns:   []
> Storage Desc Params:
> serialization.format1
> Time taken: 0.397 seconds, Fetched: 124 row(s)
> 
> So effectively that table is partitioned by dt in notime
> 
> Now what I don't understand whether that table is already partitioned as you 
> said the table already exists!
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://t

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
Mich,

I am using .withColumn to add another column “dt” that is a reformatted version 
of an existing column “timestamp”. The partitioned by column is “dt”.

We are using Spark 1.6.0 in CDH 5.7.0.

Thanks,
Ben

> On Jun 3, 2016, at 10:33 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> what version of spark are you using
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 June 2016 at 17:51, Mich Talebzadeh <mich.talebza...@gmail.com 
> <mailto:mich.talebza...@gmail.com>> wrote:
> ok what is the new column is called? you are basically adding a new column to 
> an already existing table
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 June 2016 at 17:04, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> The table already exists.
> 
>  CREATE EXTERNAL TABLE `amo_bi_events`(   
>`event_type` string COMMENT '',
>   
>`timestamp` string COMMENT '', 
>   
>`event_valid` int COMMENT '',  
>   
>`event_subtype` string COMMENT '', 
>   
>`user_ip` string COMMENT '',   
>   
>`user_id` string COMMENT '',   
>   
>`cookie_status` string COMMENT '', 
>   
>`profile_status` string COMMENT '',
>   
>`user_status` string COMMENT '',   
>   
>`previous_timestamp` string COMMENT '',
>   
>`user_agent` string COMMENT '',
>   
>`referer` string COMMENT '',   
>   
>`uri` string COMMENT '',   
>   
>`request_elapsed` bigint COMMENT '',   
>   
>`browser_languages` string COMMENT '', 
>   
>`acamp_id` int COMMENT '', 
>   
>`creative_id` int COMMENT '',  
>   
>`location_id` int COMMENT '',  
>   
>`pcamp_id` int COMMENT '', 
>   
>`pdomain_id` int COMMENT '',   
>   
>`country` string COMMENT '',   
>   
>`region` string COMMENT '',
>   
>`dma` int COMMENT '',  
>   
>`city` string COMMENT '',  
>   
>`zip` string COMMENT '',   
>   
>`isp` string COMMENT '',   
>   
>`line_speed` string COMMENT '',
>   
>`gender` string COMMENT '',
>   
>`year_of_birth` int COMMENT '',
>   
>`behaviors_read` string COMMENT '',
>   
>`behaviors_written` string COMMENT '', 
>   
>`key_value_pairs` string COMMENT '',   
>   
>`acamp_candidates` int COMMENT '', 
>   
>`tag_format` string COMMENT '',
>   
>`optimizer_name` string COMMENT '',
>   
>`optimizer_version` string COMMENT '', 
>   
>

Re: Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
 '',   

   `event_id` string COMMENT '',

   `impression_id` string COMMENT '',   

   `diagnostic_data` string COMMENT '', 

   `user_profile_mapping_source` string COMMENT '', 

   `latitude` float COMMENT '', 

   `longitude` float COMMENT '',

   `area_code` int COMMENT '',  

   `gmt_offset` string COMMENT '',  

   `in_dst` string COMMENT '',  

   `proxy_type` string COMMENT '',  

   `mobile_carrier` string COMMENT '',  

   `pop` string COMMENT '', 

   `hostname` string COMMENT '',

   `profile_ttl` string COMMENT '', 

   `timestamp_iso` string COMMENT '',   

   `reference_id` string COMMENT '',

   `identity_organization` string COMMENT '',   

   `identity_method` string COMMENT '', 

   `mappable_id` string COMMENT '', 

   `profile_expires` string COMMENT '', 

   `video_player_iframed` int COMMENT '',   

   `video_player_in_view` int COMMENT '',   

   `video_player_width` int COMMENT '', 

   `video_player_height` int COMMENT '',

   `host_domain` string COMMENT '', 

   `browser_type` string COMMENT '',

   `browser_device_cat` string COMMENT '',  

   `browser_family` string COMMENT '',  

   `browser_name` string COMMENT '',

   `browser_version` string COMMENT '', 

   `browser_major_version` string COMMENT '',   

   `browser_minor_version` string COMMENT '',   

   `os_family` string COMMENT '',   

   `os_name` string COMMENT '', 

   `os_version` string COMMENT '',  

   `os_major_version` string COMMENT '',

   `os_minor_version` string COMMENT '')

 PARTITIONED BY (`dt` timestamp)
 
 STORED AS PARQUET;

Thanks,
Ben


> On Jun 3, 2016, at 8:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> hang on are you saving this as a new table?
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 3 June 2016 at 14:13, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Does anyone know how to save data in a DataFrame to a table partitioned using 
> an existing column reformatted into a derived column?
> 
> val partitionedDf = df.withColumn("dt", 
> concat(substring($"timestamp", 1, 10), lit(" "), substring($"timestamp", 12, 
> 2), lit(":00")))
> 
> sqlContext.setConf("hive.exec.dynamic.partition", "true")
> sqlContext.setConf("hive.exec.dynamic.partition.mode", 
> "nonstrict")
> partitionedDf.write
> .mode(SaveMode.Append)
> .partitionBy("dt")
> .saveAsTable("ds.amo_bi_events")
> 
> I am getting an ArrayOutOfBounds error. There are 83 columns in the 
> destination table. But after adding the derived column, then I get an 84 
> error. I assumed that the column used for the partition would not be counted.
> 
> Can someone ple

Save to a Partitioned Table using a Derived Column

2016-06-03 Thread Benjamin Kim
Does anyone know how to save data in a DataFrame to a table partitioned using 
an existing column reformatted into a derived column?

val partitionedDf = df.withColumn("dt", 
concat(substring($"timestamp", 1, 10), lit(" "), substring($"timestamp", 12, 
2), lit(":00")))

sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", 
"nonstrict")
partitionedDf.write
.mode(SaveMode.Append)
.partitionBy("dt")
.saveAsTable("ds.amo_bi_events")

I am getting an ArrayOutOfBounds error. There are 83 columns in the destination 
table. But after adding the derived column, then I get an 84 error. I assumed 
that the column used for the partition would not be counted.

Can someone please help.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Kudu

2016-05-28 Thread Benjamin Kim
Hi Chris,

Will all this effort be rolled into 0.9.0 and be ready for use?

Thanks,
Ben

> On May 18, 2016, at 9:01 AM, Chris George <christopher.geo...@rms.com> wrote:
> 
> There is some code in review that needs some more refinement.
> It will allow upsert/insert from a dataframe using the datasource api. It 
> will also allow the creation and deletion of tables from a dataframe
> http://gerrit.cloudera.org:8080/#/c/2992/ 
> <http://gerrit.cloudera.org:8080/#/c/2992/>
> 
> Example usages will look something like:
> http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc 
> <http://gerrit.cloudera.org:8080/#/c/2992/5/docs/developing.adoc>
> 
> -Chris George
> 
> 
> On 5/18/16, 9:45 AM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
> Can someone tell me what the state is of this Spark work?
> 
> Also, does anyone have any sample code on how to update/insert data in Kudu 
> using DataFrames?
> 
> Thanks,
> Ben
> 
> 
>> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com 
>> <mailto:christopher.geo...@rms.com>> wrote:
>> 
>> SparkSQL cannot support these type of statements but we may be able to 
>> implement similar functionality through the api.
>> -Chris
>> 
>> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
>> were to be implemented.
>> 
>> MERGE INTO table_name USING table_reference ON (condition)
>>  WHEN MATCHED THEN
>>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>>  WHEN NOT MATCHED THEN
>>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
>> 
>> Cheers,
>> Ben
>> 
>>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com 
>>> <mailto:christopher.geo...@rms.com>> wrote:
>>> 
>>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>>> It does pushdown predicates which the existing input formatter based rdd 
>>> does not.
>>> 
>>> Within the next two weeks I’m planning to implement a datasource for spark 
>>> that will have pushdown predicates and insertion/update functionality (need 
>>> to look more at cassandra and the hbase datasource for best way to do this) 
>>> I agree that server side upsert would be helpful.
>>> Having a datasource would give us useful data frames and also make spark 
>>> sql usable for kudu.
>>> 
>>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>>> have had trouble getting impala to run fast with high concurrency when 
>>> compared to spark 2. We interact with datasources which do not integrate 
>>> with impala. 3. We have custom sql query planners for extended sql 
>>> functionality.
>>> 
>>> -Chris George
>>> 
>>> 
>>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> You guys make a convincing point, although on the upsert side we'll need 
>>> more support from the servers. Right now all you can do is an INSERT then, 
>>> if you get a dup key, do an UPDATE. I guess we could at least add an API on 
>>> the client side that would manage it, but it wouldn't be atomic.
>>> 
>>> J-D
>>> 
>>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
>>> <mailto:m...@clearstorydata.com>> wrote:
>>> It's pretty simple, actually.  I need to support versioned datasets in a 
>>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>>> timestamp-based read operations to support not only appending data, but 
>>> also updating existing data, and even some schema migration.  The most 
>>> typical use case is a dataset that is updated periodically (e.g., weekly or 
>>> monthly) in which the the preliminary data in the previous window (week or 
>>> month) is updated with values that are expected to remain unchanged from 
>>> then on, and a new set of preliminary values for the current window need to 
>>> be added/appended.
>>> 
>>> Using Kudu's Java API an

Re: Performance Question

2016-05-28 Thread Benjamin Kim
Todd,

It sounds like Kudu can possibly top or match those numbers put out by 
Aerospike. Do you have any performance statistics published or any instructions 
as to measure them myself as good way to test? In addition, this will be a test 
using Spark, so should I wait for Kudu version 0.9.0 where support will be 
built in?

Thanks,
Ben


> On May 27, 2016, at 9:19 PM, Todd Lipcon <t...@cloudera.com> wrote:
> 
> On Fri, May 27, 2016 at 8:20 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Mike,
> 
> First of all, thanks for the link. It looks like an interesting read. I 
> checked that Aerospike is currently at version 3.8.2.3, and in the article, 
> they are evaluating version 3.5.4. The main thing that impressed me was their 
> claim that they can beat Cassandra and HBase by 8x for writing and 25x for 
> reading. Their big claim to fame is that Aerospike can write 1M records per 
> second with only 50 nodes. I wanted to see if this is real.
> 
> 1M records per second on 50 nodes is pretty doable by Kudu as well, depending 
> on the size of your records and the insertion order. I've been playing with a 
> ~70 node cluster recently and seen 1M+ writes/second sustained, and bursting 
> above 4M. These are 1KB rows with 11 columns, and with pretty old HDD-only 
> nodes. I think newer flash-based nodes could do better.
>  
> 
> To answer your questions, we have a DMP with user profiles with many 
> attributes. We create segmentation information off of these attributes to 
> classify them. Then, we can target advertising appropriately for our sales 
> department. Much of the data processing is for applying models on all or if 
> not most of every profile’s attributes to find similarities (nearest 
> neighbor/clustering) over a large number of rows when batch processing or a 
> small subset of rows for quick online scoring. So, our use case is a typical 
> advanced analytics scenario. We have tried HBase, but it doesn’t work well 
> for these types of analytics.
> 
> I read, that Aerospike in the release notes, they did do many improvements 
> for batch and scan operations.
> 
> I wonder what your thoughts are for using Kudu for this.
> 
> Sounds like a good Kudu use case to me. I've heard great things about 
> Aerospike for the low latency random access portion, but I've also heard that 
> it's _very_ expensive, and not particularly suited to the columnar scan 
> workload. Lastly, I think the Apache license of Kudu is much more appealing 
> than the AGPL3 used by Aerospike. But, that's not really a direct answer to 
> the performance question :)
>  
> 
> Thanks,
> Ben
> 
> 
>> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com 
>> <mailto:mpe...@cloudera.com>> wrote:
>> 
>> Have you considered whether you have a scan heavy or a random access heavy 
>> workload? Have you considered whether you always access / update a whole row 
>> vs only a partial row? Kudu is a column store so has some awesome 
>> performance characteristics when you are doing a lot of scanning of just a 
>> couple of columns.
>> 
>> I don't know the answer to your question but if your concern is performance 
>> then I would be interested in seeing comparisons from a perf perspective on 
>> certain workloads.
>> 
>> Finally, a year ago Aerospike did quite poorly in a Jepsen test: 
>> https://aphyr.com/posts/324-jepsen-aerospike 
>> <https://aphyr.com/posts/324-jepsen-aerospike>
>> 
>> I wonder if they have addressed any of those issues.
>> 
>> Mike
>> 
>> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I am just curious. How will Kudu compare with Aerospike 
>> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a Spark 
>> Roadshow and found out about this piece of software. It appears to fit our 
>> use case perfectly since we are an ad-tech company trying to leverage our 
>> user profiles data. Plus, it already has a Spark connector and has a 
>> SQL-like client. The tables can be accessed using Spark SQL DataFrames and, 
>> also, made into SQL tables for direct use with Spark SQL ODBC/JDBC 
>> Thriftserver. I see from the work done here 
>> http://gerrit.cloudera.org:8080/#/c/2992/ 
>> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration is 
>> well underway and, from the looks of it lately, almost complete. I would 
>> prefer to use Kudu since we are already a Cloudera shop, and Kudu is easy to 
>> deploy and configure using Cloudera Manager. I also hope that some of 
>> Aerospike’s speed optimization techniques can make it into Kudu in the 
>> future, if they have not been already thought of or included.
>> 
>> Just some thoughts…
>> 
>> Cheers,
>> Ben
>> 
>> 
>> -- 
>> --
>> Mike Percy
>> Software Engineer, Cloudera
>> 
>> 
> 
> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera



Re: Performance Question

2016-05-27 Thread Benjamin Kim
Hi Mike,

First of all, thanks for the link. It looks like an interesting read. I checked 
that Aerospike is currently at version 3.8.2.3, and in the article, they are 
evaluating version 3.5.4. The main thing that impressed me was their claim that 
they can beat Cassandra and HBase by 8x for writing and 25x for reading. Their 
big claim to fame is that Aerospike can write 1M records per second with only 
50 nodes. I wanted to see if this is real.

To answer your questions, we have a DMP with user profiles with many 
attributes. We create segmentation information off of these attributes to 
classify them. Then, we can target advertising appropriately for our sales 
department. Much of the data processing is for applying models on all or if not 
most of every profile’s attributes to find similarities (nearest 
neighbor/clustering) over a large number of rows when batch processing or a 
small subset of rows for quick online scoring. So, our use case is a typical 
advanced analytics scenario. We have tried HBase, but it doesn’t work well for 
these types of analytics.

I read, that Aerospike in the release notes, they did do many improvements for 
batch and scan operations.

I wonder what your thoughts are for using Kudu for this.

Thanks,
Ben


> On May 27, 2016, at 6:21 PM, Mike Percy <mpe...@cloudera.com> wrote:
> 
> Have you considered whether you have a scan heavy or a random access heavy 
> workload? Have you considered whether you always access / update a whole row 
> vs only a partial row? Kudu is a column store so has some awesome performance 
> characteristics when you are doing a lot of scanning of just a couple of 
> columns.
> 
> I don't know the answer to your question but if your concern is performance 
> then I would be interested in seeing comparisons from a perf perspective on 
> certain workloads.
> 
> Finally, a year ago Aerospike did quite poorly in a Jepsen test: 
> https://aphyr.com/posts/324-jepsen-aerospike 
> <https://aphyr.com/posts/324-jepsen-aerospike>
> 
> I wonder if they have addressed any of those issues.
> 
> Mike
> 
> On Friday, May 27, 2016, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I am just curious. How will Kudu compare with Aerospike 
> (http://www.aerospike.com <http://www.aerospike.com/>)? I went to a Spark 
> Roadshow and found out about this piece of software. It appears to fit our 
> use case perfectly since we are an ad-tech company trying to leverage our 
> user profiles data. Plus, it already has a Spark connector and has a SQL-like 
> client. The tables can be accessed using Spark SQL DataFrames and, also, made 
> into SQL tables for direct use with Spark SQL ODBC/JDBC Thriftserver. I see 
> from the work done here http://gerrit.cloudera.org:8080/#/c/2992/ 
> <http://gerrit.cloudera.org:8080/#/c/2992/> that the Spark integration is 
> well underway and, from the looks of it lately, almost complete. I would 
> prefer to use Kudu since we are already a Cloudera shop, and Kudu is easy to 
> deploy and configure using Cloudera Manager. I also hope that some of 
> Aerospike’s speed optimization techniques can make it into Kudu in the 
> future, if they have not been already thought of or included.
> 
> Just some thoughts…
> 
> Cheers,
> Ben
> 
> 
> -- 
> --
> Mike Percy
> Software Engineer, Cloudera
> 
> 



Performance Question

2016-05-27 Thread Benjamin Kim
I am just curious. How will Kudu compare with Aerospike 
(http://www.aerospike.com)? I went to a Spark Roadshow and found out about this 
piece of software. It appears to fit our use case perfectly since we are an 
ad-tech company trying to leverage our user profiles data. Plus, it already has 
a Spark connector and has a SQL-like client. The tables can be accessed using 
Spark SQL DataFrames and, also, made into SQL tables for direct use with Spark 
SQL ODBC/JDBC Thriftserver. I see from the work done here 
http://gerrit.cloudera.org:8080/#/c/2992/ that the Spark integration is well 
underway and, from the looks of it lately, almost complete. I would prefer to 
use Kudu since we are already a Cloudera shop, and Kudu is easy to deploy and 
configure using Cloudera Manager. I also hope that some of Aerospike’s speed 
optimization techniques can make it into Kudu in the future, if they have not 
been already thought of or included.

Just some thoughts…

Cheers,
Ben

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
I got my answer.

The way to access S3 has changed.

val hadoopConf = sc.hadoopConfiguration
hadoopConf.set("fs.s3a.access.key", accessKey)
hadoopConf.set("fs.s3a.secret.key", secretKey)

val lines = ssc.textFileStream("s3a://amg-events-out/")

This worked.

Cheers,
Ben


> On May 21, 2016, at 4:18 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> Maybe more than one version of jets3t-xx.jar was on the classpath.
> 
> FYI
> 
> On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of 
> Spark 1.6.0. It seems not to work. I keep getting this error.
> 
> Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand 
> stack
> Exception Details:
>   Location:
> 
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
>  @155: invokevirtual
>   Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not 
> assignable to 'org/jets3t/service/model/StorageObject'
>   Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', 
> integer }
>   Bytecode:
> 0x000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 0x010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 0x020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 0x030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 0x040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 0x050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 0x060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 0x070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 0x080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 0x090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0x0a0: 000a 4e2a 2d2b b700 c7b1
>   Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
>   Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
> 
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at org.apache.spark.streaming.dstream.FileInputDStream.org 
> <http://org.apache.spark.streaming.dstream.fileinputdstream.org/>$apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:297)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:198)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:149)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at scala.Option.orElse(Option.scala:257)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
>

Re: Spark Streaming S3 Error

2016-05-21 Thread Benjamin Kim
Ted,

I only see 1 jets3t-0.9.0 jar in the classpath after running this to list the 
jars.

val cl = ClassLoader.getSystemClassLoader
cl.asInstanceOf[java.net.URLClassLoader].getURLs.foreach(println)

/opt/cloudera/parcels/CDH-5.7.0-1.cdh5.7.0.p0.45/jars/jets3t-0.9.0.jar

I don’t know what else could be wrong.

Thanks,
Ben

> On May 21, 2016, at 4:18 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> Maybe more than one version of jets3t-xx.jar was on the classpath.
> 
> FYI
> 
> On Fri, May 20, 2016 at 8:31 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of 
> Spark 1.6.0. It seems not to work. I keep getting this error.
> 
> Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand 
> stack
> Exception Details:
>   Location:
> 
> org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
>  @155: invokevirtual
>   Reason:
> Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not 
> assignable to 'org/jets3t/service/model/StorageObject'
>   Current Frame:
> bci: @155
> flags: { }
> locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' }
> stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 
> 'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', 
> integer }
>   Bytecode:
> 0x000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
> 0x010: 5659 b701 5713 0192 b601 5b2b b601 5b13
> 0x020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
> 0x030: b400 7db6 00e7 b601 5bb6 015e b901 9802
> 0x040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
> 0x050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
> 0x060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
> 0x070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
> 0x080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
> 0x090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
> 0x0a0: 000a 4e2a 2d2b b700 c7b1
>   Exception Handler Table:
> bci [0, 116] => handler: 162
> bci [117, 159] => handler: 162
>   Stackmap Table:
> same_frame_extended(@65)
> same_frame(@117)
> same_locals_1_stack_item_frame(@162,Object[#139])
> same_frame(@169)
> 
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at org.apache.spark.streaming.dstream.FileInputDStream.org 
> <http://org.apache.spark.streaming.dstream.fileinputdstream.org/>$apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:297)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:198)
> at 
> org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:149)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
> at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
> at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
> at scala.Option.orElse(Option.scala:257)
> at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
> at 
> org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
> at 
> org.apache.spark.

Spark Streaming S3 Error

2016-05-20 Thread Benjamin Kim
I am trying to stream files from an S3 bucket using CDH 5.7.0’s version of 
Spark 1.6.0. It seems not to work. I keep getting this error.

Exception in thread "JobGenerator" java.lang.VerifyError: Bad type on operand 
stack
Exception Details:
  Location:

org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore.copy(Ljava/lang/String;Ljava/lang/String;)V
 @155: invokevirtual
  Reason:
Type 'org/jets3t/service/model/S3Object' (current frame, stack[4]) is not 
assignable to 'org/jets3t/service/model/StorageObject'
  Current Frame:
bci: @155
flags: { }
locals: { 'org/apache/hadoop/fs/s3native/Jets3tNativeFileSystemStore', 
'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object' }
stack: { 'org/jets3t/service/S3Service', 'java/lang/String', 
'java/lang/String', 'java/lang/String', 'org/jets3t/service/model/S3Object', 
integer }
  Bytecode:
0x000: b200 fcb9 0190 0100 9900 39b2 00fc bb01
0x010: 5659 b701 5713 0192 b601 5b2b b601 5b13
0x020: 0194 b601 5b2c b601 5b13 0196 b601 5b2a
0x030: b400 7db6 00e7 b601 5bb6 015e b901 9802
0x040: 002a b400 5799 0030 2ab4 0047 2ab4 007d
0x050: 2b01 0101 01b6 019b 4e2a b400 6b09 949e
0x060: 0016 2db6 019c 2ab4 006b 949e 000a 2a2d
0x070: 2cb6 01a0 b1bb 00a0 592c b700 a14e 2d2a
0x080: b400 73b6 00b0 2ab4 0047 2ab4 007d b600
0x090: e72b 2ab4 007d b600 e72d 03b6 01a4 57a7
0x0a0: 000a 4e2a 2d2b b700 c7b1   
  Exception Handler Table:
bci [0, 116] => handler: 162
bci [117, 159] => handler: 162
  Stackmap Table:
same_frame_extended(@65)
same_frame(@117)
same_locals_1_stack_item_frame(@162,Object[#139])
same_frame(@169)

at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.createDefaultStore(NativeS3FileSystem.java:338)
at 
org.apache.hadoop.fs.s3native.NativeS3FileSystem.initialize(NativeS3FileSystem.java:328)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2696)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2733)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2715)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:382)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$fs(FileInputDStream.scala:297)
at 
org.apache.spark.streaming.dstream.FileInputDStream.findNewFiles(FileInputDStream.scala:198)
at 
org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:149)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.MappedDStream.compute(MappedDStream.scala:35)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:352)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:351)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:346)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:344)
at scala.Option.orElse(Option.scala:257)
at 
org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:341)
at 
org.apache.spark.streaming.dstream.ForEachDStream.generateJob(ForEachDStream.scala:47)
at 

Re: Spark on Kudu

2016-05-18 Thread Benjamin Kim
Can someone tell me what the state is of this Spark work?

Also, does anyone have any sample code on how to update/insert data in Kudu 
using DataFrames?

Thanks,
Ben


> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com> wrote:
> 
> SparkSQL cannot support these type of statements but we may be able to 
> implement similar functionality through the api.
> -Chris
> 
> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
> were to be implemented.
> 
> MERGE INTO table_name USING table_reference ON (condition)
>  WHEN MATCHED THEN
>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>  WHEN NOT MATCHED THEN
>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
> 
> Cheers,
> Ben
> 
>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com 
>> <mailto:christopher.geo...@rms.com>> wrote:
>> 
>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>> It does pushdown predicates which the existing input formatter based rdd 
>> does not.
>> 
>> Within the next two weeks I’m planning to implement a datasource for spark 
>> that will have pushdown predicates and insertion/update functionality (need 
>> to look more at cassandra and the hbase datasource for best way to do this) 
>> I agree that server side upsert would be helpful.
>> Having a datasource would give us useful data frames and also make spark sql 
>> usable for kudu.
>> 
>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>> have had trouble getting impala to run fast with high concurrency when 
>> compared to spark 2. We interact with datasources which do not integrate 
>> with impala. 3. We have custom sql query planners for extended sql 
>> functionality.
>> 
>> -Chris George
>> 
>> 
>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> You guys make a convincing point, although on the upsert side we'll need 
>> more support from the servers. Right now all you can do is an INSERT then, 
>> if you get a dup key, do an UPDATE. I guess we could at least add an API on 
>> the client side that would manage it, but it wouldn't be atomic.
>> 
>> J-D
>> 
>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
>> <mailto:m...@clearstorydata.com>> wrote:
>> It's pretty simple, actually.  I need to support versioned datasets in a 
>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>> timestamp-based read operations to support not only appending data, but also 
>> updating existing data, and even some schema migration.  The most typical 
>> use case is a dataset that is updated periodically (e.g., weekly or monthly) 
>> in which the the preliminary data in the previous window (week or month) is 
>> updated with values that are expected to remain unchanged from then on, and 
>> a new set of preliminary values for the current window need to be 
>> added/appended.
>> 
>> Using Kudu's Java API and developing additional functionality on top of what 
>> Kudu has to offer isn't too much to ask, but the ease of integration with 
>> Spark SQL will gate how quickly we would move to using Kudu and how 
>> seriously we'd look at alternatives before making that decision. 
>> 
>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> Mark,
>> 
>> Thanks for taking some time to reply in this thread, glad it caught the 
>> attention of other folks!
>> 
>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com 
>> <mailto:m...@clearstorydata.com>> wrote:
>> Do they care being able to insert into Kudu with SparkSQL
>> 
>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
>> refactoring of some Spark SQL-oriented insert functionality while trying to 
>> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting 
>> inserts with Spark SQL will be a key consideration as to whether we adopt 
>> Kudu.
>> 
>> I'd like to know more about why SparkSQL inserts in necessa

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-16 Thread Benjamin Kim
I have a curiosity question. These forever/unlimited DataFrames/DataSets will 
persist and be query capable. I still am foggy about how this data will be 
stored. As far as I know, memory is finite. Will the data be spilled to disk 
and be retrievable if the query spans data not in memory? Is Tachyon (Alluxio), 
HDFS (Parquet), NoSQL (HBase, Cassandra), RDBMS (PostgreSQL, MySQL), Object 
Store (S3, Swift), or any else I can’t think of going to be the underlying near 
real-time storage system?

Thanks,
Ben

> On May 15, 2016, at 3:36 PM, Yuval Itzchakov  wrote:
> 
> Hi Ofir,
> Thanks for the elaborated answer. I have read both documents, where they do a 
> light touch on infinite Dataframes/Datasets. However, they do not go in depth 
> as regards to how existing transformations on DStreams, for example, will be 
> transformed into the Dataset APIs. I've been browsing the 2.0 branch and have 
> yet been able to understand how they correlate.
> 
> Also, placing SparkSession in the sql package seems like a peculiar choice, 
> since this is going to be the global abstraction over 
> SparkContext/StreamingContext from now on.
> 
> On Sun, May 15, 2016, 23:42 Ofir Manor  > wrote:
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main 
> ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging 
> of Dataset and Dataframe - which is why it inherits all the SparkSQL 
> goodness), while RDD seems as a low-level API only for special cases. The new 
> Dataset should also support both batch and streaming - replacing (eventually) 
> DStream as well. See the design docs in SPARK-13485 (unified API) and 
> SPARK-8360 (StructuredStreaming) for a good intro. 
> However, as you noted, not all will be fully delivered in 2.0. For example, 
> it seems that streaming from / to Kafka using StructuredStreaming didn't make 
> it (so far?) to 2.0 (which is a showstopper for me). 
> Anyway, as far as I understand, you should be able to apply stateful 
> operators (non-RDD) on Datasets (for example, the new event-time window 
> processing SPARK-8360). The gap I see is mostly limited streaming sources / 
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples 
> will align with the current offering...
> 
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov  > wrote:
> I've been reading/watching videos about the upcoming Spark 2.0 release which
> brings us Structured Streaming. One thing I've yet to understand is how this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
> 
> All examples I can find, in the Spark repository/different videos is someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
> this gives me a hunch that this is somehow all related to SQL and the likes,
> and not really to DStreams.
> 
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
> 
> I'd be happy anyone could shed some light on this.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 



CDH 5.7.0

2016-05-16 Thread Benjamin Kim
Has anyone got Phoenix to work with CDH 5.7.0? I tried manually patching and 
building the project using https://issues.apache.org/jira/browse/PHOENIX-2834 
 as a guide. I followed the 
instructions to install the components detailed in the top section of 
http://phoenix.apache.org/installation.html 
. I still can’t get it to work.

Thanks for any help in advance.

Cheers,
Ben



Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Ofir,

Thanks for the clarification. I was confused for the moment. The links will be 
very helpful.


> On May 15, 2016, at 2:32 PM, Ofir Manor <ofir.ma...@equalum.io> wrote:
> 
> Ben,
> I'm just a Spark user - but at least in March Spark Summit, that was the main 
> term used.
> Taking a step back from the details, maybe this new post from Reynold is a 
> better intro to Spark 2.0 highlights 
> https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html
>  
> <https://databricks.com/blog/2016/05/11/spark-2-0-technical-preview-easier-faster-and-smarter.html>
> 
> If you want to drill down, go to SPARK-8360 "Structured Streaming (aka 
> Streaming DataFrames)". The design doc (written by Reynold in March) is very 
> readable:
>  https://issues.apache.org/jira/browse/SPARK-8360 
> <https://issues.apache.org/jira/browse/SPARK-8360>
> 
> Regarding directly querying (SQL) the state managed by a streaming process - 
> I don't know if that will land in 2.0 or only later.
> 
> Hope that helps,
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: 
> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io>
> On Sun, May 15, 2016 at 11:58 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Ofir,
> 
> I just recently saw the webinar with Reynold Xin. He mentioned the Spark 
> Session unification efforts, but I don’t remember the DataSet for Structured 
> Streaming aka Continuous Applications as he put it. He did mention streaming 
> or unlimited DataFrames for Structured Streaming so one can directly query 
> the data from it. Has something changed since then?
> 
> Thanks,
> Ben
> 
> 
>> On May 15, 2016, at 1:42 PM, Ofir Manor <ofir.ma...@equalum.io 
>> <mailto:ofir.ma...@equalum.io>> wrote:
>> 
>> Hi Yuval,
>> let me share my understanding based on similar questions I had.
>> First, Spark 2.x aims to replace a whole bunch of its APIs with just two 
>> main ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset 
>> (merging of Dataset and Dataframe - which is why it inherits all the 
>> SparkSQL goodness), while RDD seems as a low-level API only for special 
>> cases. The new Dataset should also support both batch and streaming - 
>> replacing (eventually) DStream as well. See the design docs in SPARK-13485 
>> (unified API) and SPARK-8360 (StructuredStreaming) for a good intro. 
>> However, as you noted, not all will be fully delivered in 2.0. For example, 
>> it seems that streaming from / to Kafka using StructuredStreaming didn't 
>> make it (so far?) to 2.0 (which is a showstopper for me). 
>> Anyway, as far as I understand, you should be able to apply stateful 
>> operators (non-RDD) on Datasets (for example, the new event-time window 
>> processing SPARK-8360). The gap I see is mostly limited streaming sources / 
>> sinks migrated to the new (richer) API and semantics.
>> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples 
>> will align with the current offering...
>> 
>> 
>> Ofir Manor
>> 
>> Co-Founder & CTO | Equalum
>> 
>> 
>> Mobile: +972-54-7801286 <tel:%2B972-54-7801286> | Email: 
>> ofir.ma...@equalum.io <mailto:ofir.ma...@equalum.io>
>> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov <yuva...@gmail.com 
>> <mailto:yuva...@gmail.com>> wrote:
>> I've been reading/watching videos about the upcoming Spark 2.0 release which
>> brings us Structured Streaming. One thing I've yet to understand is how this
>> relates to the current state of working with Streaming in Spark with the
>> DStream abstraction.
>> 
>> All examples I can find, in the Spark repository/different videos is someone
>> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
>> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
>> this gives me a hunch that this is somehow all related to SQL and the likes,
>> and not really to DStreams.
>> 
>> What I'm failing to understand is: Will this feature impact how we do
>> Streaming today? Will I be able to consume a Kafka source in a streaming
>> fashion (like we do today when we open a stream using KafkaUtils)? Will we
>> be able to do state-full operations on a Dataset[T] like we do today using
>> MapWithStateRDD? Or will there be a subset of operations that the catalyst
>> optimizer can understand such as aggregate and such?
>> 
>> I'd be happy anyone could s

Re: Structured Streaming in Spark 2.0 and DStreams

2016-05-15 Thread Benjamin Kim
Hi Ofir,

I just recently saw the webinar with Reynold Xin. He mentioned the Spark 
Session unification efforts, but I don’t remember the DataSet for Structured 
Streaming aka Continuous Applications as he put it. He did mention streaming or 
unlimited DataFrames for Structured Streaming so one can directly query the 
data from it. Has something changed since then?

Thanks,
Ben


> On May 15, 2016, at 1:42 PM, Ofir Manor  wrote:
> 
> Hi Yuval,
> let me share my understanding based on similar questions I had.
> First, Spark 2.x aims to replace a whole bunch of its APIs with just two main 
> ones - SparkSession (replacing Hive/SQL/Spark Context) and Dataset (merging 
> of Dataset and Dataframe - which is why it inherits all the SparkSQL 
> goodness), while RDD seems as a low-level API only for special cases. The new 
> Dataset should also support both batch and streaming - replacing (eventually) 
> DStream as well. See the design docs in SPARK-13485 (unified API) and 
> SPARK-8360 (StructuredStreaming) for a good intro. 
> However, as you noted, not all will be fully delivered in 2.0. For example, 
> it seems that streaming from / to Kafka using StructuredStreaming didn't make 
> it (so far?) to 2.0 (which is a showstopper for me). 
> Anyway, as far as I understand, you should be able to apply stateful 
> operators (non-RDD) on Datasets (for example, the new event-time window 
> processing SPARK-8360). The gap I see is mostly limited streaming sources / 
> sinks migrated to the new (richer) API and semantics.
> Anyway, I'm pretty sure once 2.0 gets to RC, the documentation and examples 
> will align with the current offering...
> 
> 
> Ofir Manor
> 
> Co-Founder & CTO | Equalum
> 
> 
> Mobile: +972-54-7801286  | Email: 
> ofir.ma...@equalum.io 
> On Sun, May 15, 2016 at 1:52 PM, Yuval.Itzchakov  > wrote:
> I've been reading/watching videos about the upcoming Spark 2.0 release which
> brings us Structured Streaming. One thing I've yet to understand is how this
> relates to the current state of working with Streaming in Spark with the
> DStream abstraction.
> 
> All examples I can find, in the Spark repository/different videos is someone
> streaming local JSON files or reading from HDFS/S3/SQL. Also, when browsing
> the source, SparkSession seems to be defined inside org.apache.spark.sql, so
> this gives me a hunch that this is somehow all related to SQL and the likes,
> and not really to DStreams.
> 
> What I'm failing to understand is: Will this feature impact how we do
> Streaming today? Will I be able to consume a Kafka source in a streaming
> fashion (like we do today when we open a stream using KafkaUtils)? Will we
> be able to do state-full operations on a Dataset[T] like we do today using
> MapWithStateRDD? Or will there be a subset of operations that the catalyst
> optimizer can understand such as aggregate and such?
> 
> I'd be happy anyone could shed some light on this.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-in-Spark-2-0-and-DStreams-tp26959.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Sparse Data

2016-05-12 Thread Benjamin Kim
Can Kudu handle the use case where sparse data is involved? In many of our 
processes, we deal with data that can have any number of columns and many 
previously unknown column names depending on what attributes are brought in at 
the time. Currently, we use HBase to handle this. Since Kudu is based on HBase, 
can it do the same? Or, do we have to use a map data type column for this?

Thanks,
Ben



Re: Help with getting Zeppelin running on CH 5.7

2016-05-11 Thread Benjamin Kim
It’s currently being addressed here.

https://github.com/apache/incubator-zeppelin/pull/868 



> On May 11, 2016, at 3:08 PM, Shankar Roy  wrote:
> 
> Hi,
> I am trying to get Zeppeling running on a pseudo node cluster of CDH 5.7.0
> 
> I have posted this issue on Cloudera forum also 
> http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/java-lang-NoSuchMethodException-exception-with-CDH-5-7-0-Apache/m-p/40737/highlight/false#M1680
>  
> 
> 
> 
> The command used to compile mvn clean package -Pspark-1.6 -Ppyspark 
> -Dhadoop.version=2.6.0-cdh5.7.0 -Phadoop-2.6 -Pyarn -Pvendor-repo -DskipTests
> 
> When I am trying to run sample code of Zeppeling I am getting error.
> 
> java.lang.NoSuchMethodException: 
> org.apache.spark.repl.SparkILoop$SparkILoopInterpreter.classServerUri()
> at java.lang.Class.getMethod(Class.java:1665)
> at 
> org.apache.zeppelin.spark.SparkInterpreter.createSparkContext(SparkInterpreter.java:275)
> at 
> org.apache.zeppelin.spark.SparkInterpreter.getSparkContext(SparkInterpreter.java:150)
> at org.apache.zeppelin.spark.SparkInterpreter.open(SparkInterpreter.java:525)
> at 
> org.apache.zeppelin.interpreter.ClassloaderInterpreter.open(ClassloaderInterpreter.java:74)
> at 
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.open(LazyOpenInterpreter.java:68)
> at 
> org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:92)
> at 
> org.apache.zeppelin.interpreter.remote.RemoteInterpreterServer$InterpretJob.jobRun(RemoteInterpreterServer.java:345)
> at org.apache.zeppelin.scheduler.Job.run(Job.java:176)
> at org.apache.zeppelin.scheduler.FIFOScheduler$1.run(FIFOScheduler.java:139)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



Re: Save DataFrame to HBase

2016-05-10 Thread Benjamin Kim
Ted,

Will the hbase-spark module allow for creating tables in Spark SQL that 
reference the hbase tables underneath? In this way, users can query using just 
SQL.

Thanks,
Ben

> On Apr 28, 2016, at 3:09 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> Hbase 2.0 release likely would come after Spark 2.0 release. 
> 
> There're other features being developed in hbase 2.0
> I am not sure when hbase 2.0 would be released. 
> 
> The refguide is incomplete. 
> Zhan has assigned the doc JIRA to himself. The documentation would be done 
> after fixing bugs in hbase-spark module. 
> 
> Cheers
> 
> On Apr 27, 2016, at 10:31 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
>> Hi Ted,
>> 
>> Do you know when the release will be? I also see some documentation for 
>> usage of the hbase-spark module at the hbase website. But, I don’t see an 
>> example on how to save data. There is only one for reading/querying data. 
>> Will this be added when the final version does get released?
>> 
>> Thanks,
>> Ben
>> 
>>> On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com 
>>> <mailto:yuzhih...@gmail.com>> wrote:
>>> 
>>> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can 
>>> do this.
>>> 
>>> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Has anyone found an easy way to save a DataFrame into HBase?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> <mailto:user-h...@spark.apache.org>
>>> 
>>> 
>> 



Zeppelin 0.6 Build

2016-05-07 Thread Benjamin Kim
When trying to build the latest from Git, I get these errors.

[ERROR] 
/home/zeppelin/incubator-zeppelin/spark/src/main/java/org/apache/zeppelin/spark/ZeppelinR.java:[25,25]
 package parquet.org.slf4j does not exist
[ERROR] 
/home/zeppelin/incubator-zeppelin/spark/src/main/java/org/apache/zeppelin/spark/ZeppelinR.java:[26,25]
 package parquet.org.slf4j does not exist
[ERROR] 
/home/zeppelin/incubator-zeppelin/spark/src/main/java/org/apache/zeppelin/spark/ZeppelinR.java:[37,3]
 cannot find symbol

Please let me know what I can do to work around this.

Thanks,
Ben



Completed Tasks in YARN will not release resources

2016-04-30 Thread Benjamin Kim
Has anyone encountered this problem with YARN. It all started after an attempt 
to upgrade from CDH 5.4.8 to CDH 5.5.2.

I ran jobs overnight, and they never completed. But, it did take down the YARN 
ResourceManager and multiple NodeManagers after 5 or 6 hours. There was one job 
that out of 450 mappers, only 64 completed, 386 pending, and 0 running. The 
pending mappers are in a Scheduled state.

Each data node has 24 cores, 64 GB, 6 drives x 2 TB. NodeManager is allocated 
45GB; Mappers are allocated 4GB (3.2GB Heap); Reducers are allocated 8GB (6.4GB 
Heap); AM is allocated 8GB (6.4 GB Heap).

There was a note for each completed task attempt stating that it was in the 
finished state for too long. I believe that the task is not releasing the 
container when it’s done or is not communicating it back.

Does anyone have any ideas?

Thanks,
Ben


-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



AWS SDK Client

2016-04-28 Thread Benjamin Kim
Has anyone used the AWS SDK client libraries in Java to instantiate a client 
for Spark jobs? Up to a few days ago, the SQS client was not having any 
problems, but all of a sudden, this error came up.

java.lang.NoSuchMethodError: 
org.apache.http.conn.scheme.Scheme.(Ljava/lang/String;ILorg/apache/http/conn/scheme/SchemeSocketFactory;)V
at 
org.apache.http.impl.conn.SchemeRegistryFactory.createDefault(SchemeRegistryFactory.java:50)
at 
org.apache.http.impl.conn.PoolingClientConnectionManager.(PoolingClientConnectionManager.java:95)
at 
com.amazonaws.http.ConnectionManagerFactory.createPoolingClientConnManager(ConnectionManagerFactory.java:26)
at 
com.amazonaws.http.HttpClientFactory.createHttpClient(HttpClientFactory.java:96)
at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:158)
at 
com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:119)
at 
com.amazonaws.services.sqs.AmazonSQSClient.(AmazonSQSClient.java:242)
at 
com.amazonaws.services.sqs.AmazonSQSClient.(AmazonSQSClient.java:219)
at 
com.amazonaws.services.sqs.AmazonSQSClient.(AmazonSQSClient.java:123)

Has anyone seen this kind of problem before? Can someone help?

Thanks,
Ben

Re: Spark 2.0 Release Date

2016-04-28 Thread Benjamin Kim
Next Thursday is Databricks' webinar on Spark 2.0. If you are attending, I bet 
many are going to ask when the release will be. Last time they did this, Spark 
1.6 came out not too long afterward.

> On Apr 28, 2016, at 5:21 AM, Sean Owen  wrote:
> 
> I don't know if anyone has begun a firm discussion on dates, but there
> are >100 open issues and ~10 blockers, so still some work to do before
> code freeze, it looks like. My unofficial guess is mid June before
> it's all done.
> 
> On Thu, Apr 28, 2016 at 12:43 PM, Arun Patel  wrote:
>> A small request.
>> 
>> Would you mind providing an approximate date of Spark 2.0 release?  Is it
>> early May or Mid May or End of May?
>> 
>> Thanks,
>> Arun
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark 2.0+ Structured Streaming

2016-04-28 Thread Benjamin Kim
Can someone explain to me how the new Structured Streaming works in the 
upcoming Spark 2.0+? I’m a little hazy how data will be stored and referenced 
if it can be queried and/or batch processed directly from streams and if the 
data will be append only to or will there be some sort of upsert capability 
available. This almost sounds similar to what AWS Kinesis is trying to achieve, 
but it can only store the data for 24 hours. Am I close?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim
Hi Ted,

Do you know when the release will be? I also see some documentation for usage 
of the hbase-spark module at the hbase website. But, I don’t see an example on 
how to save data. There is only one for reading/querying data. Will this be 
added when the final version does get released?

Thanks,
Ben

> On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do 
> this.
> 
> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone found an easy way to save a DataFrame into HBase?
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Re: Save DataFrame to HBase

2016-04-27 Thread Benjamin Kim
Daniel,

If you can get the code snippet, that would be great! I’ve been trying to get 
it to work for me as well. The examples on the Phoenix website do not work for 
me. If you are willing to also, can you include your setup to make Phoenix work 
with Spark?

Thanks,
Ben

> On Apr 27, 2016, at 11:46 AM, Paras sachdeva <paras.sachdeva11...@gmail.com> 
> wrote:
> 
> Hi Daniel,
> 
> Would you possibly be able to share the snipped to code you have used ?
> 
> Thank you.
> 
> On Wed, Apr 27, 2016 at 3:13 PM, Daniel Haviv 
> <daniel.ha...@veracity-group.com <mailto:daniel.ha...@veracity-group.com>> 
> wrote:
> Hi Benjamin,
> Yes it should work.
> 
> Let me know if you need further assistance I might be able to get the code 
> I've used for that project.
> 
> Thank you.
> Daniel
> 
> On 24 Apr 2016, at 17:35, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
>> Hi Daniel,
>> 
>> How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which 
>> comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
>> 
>> Thanks,
>> Ben
>> 
>>> On Apr 24, 2016, at 1:43 AM, Daniel Haviv <daniel.ha...@veracity-group.com 
>>> <mailto:daniel.ha...@veracity-group.com>> wrote:
>>> 
>>> Hi,
>>> I tried saving DF to HBase using a hive table with hbase storage handler 
>>> and hiveContext but it failed due to a bug.
>>> 
>>> I was able to persist the DF to hbase using Apache Pheonix which was pretty 
>>> simple.
>>> 
>>> Thank you.
>>> Daniel
>>> 
>>> On 21 Apr 2016, at 16:52, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>>> Has anyone found an easy way to save a DataFrame into HBase?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>>> <mailto:user-h...@spark.apache.org>
>>>> 
>> 
> 



Convert DataFrame to Array of Arrays

2016-04-24 Thread Benjamin Kim
I have data in a DataFrame loaded from a CSV file. I need to load this data 
into HBase using an RDD formatted in a certain way.

val rdd = sc.parallelize(
Array(key1,
(ColumnFamily, ColumnName1, Value1),
(ColumnFamily, ColumnName2, Value2),
(…),
key2,
(ColumnFamily, ColumnName1, Value1),
(ColumnFamily, ColumnName2, Value2),
(…),
…)
)

Can someone help me to iterate through each column in a row of data to build 
such an Array structure?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save DataFrame to HBase

2016-04-24 Thread Benjamin Kim
Hi Daniel,

How did you get the Phoenix plugin to work? I have CDH 5.5.2 installed which 
comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?

Thanks,
Ben

> On Apr 24, 2016, at 1:43 AM, Daniel Haviv <daniel.ha...@veracity-group.com> 
> wrote:
> 
> Hi,
> I tried saving DF to HBase using a hive table with hbase storage handler and 
> hiveContext but it failed due to a bug.
> 
> I was able to persist the DF to hbase using Apache Pheonix which was pretty 
> simple.
> 
> Thank you.
> Daniel
> 
> On 21 Apr 2016, at 16:52, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
>> Has anyone found an easy way to save a DataFrame into HBase?
>> 
>> Thanks,
>> Ben
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 



Re: Save DataFrame to HBase

2016-04-21 Thread Benjamin Kim
Hi Ted,

Can this module be used with an older version of HBase, such as 1.0 or 1.1? 
Where can I get the module from?

Thanks,
Ben

> On Apr 21, 2016, at 6:56 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> 
> The hbase-spark module in Apache HBase (coming with hbase 2.0 release) can do 
> this.
> 
> On Thu, Apr 21, 2016 at 6:52 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone found an easy way to save a DataFrame into HBase?
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Save DataFrame to HBase

2016-04-21 Thread Benjamin Kim
Has anyone found an easy way to save a DataFrame into HBase?

Thanks,
Ben


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HBase Spark Module

2016-04-20 Thread Benjamin Kim
I see that the new CDH 5.7 has been release with the HBase Spark module 
built-in. I was wondering if I could just download it and use the hbase-spark 
jar file for CDH 5.5. Has anyone tried this yet?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: JSON Usage

2016-04-17 Thread Benjamin Kim
Hyukjin,

This is what I did so far. I didn’t use DataSet yet or maybe I don’t need to.

var df: DataFrame = null
for(message <- messages) {
val bodyRdd = sc.parallelize(message.getBody() :: Nil)
val fileDf = sqlContext.read.json(bodyRdd)
.select(
$"Records.s3.bucket.name".as("bucket"),
$"Records.s3.object.key".as("key")
)
if (df != null) {
  df = df.unionAll(fileDf)
} else {
  df = fileDf
}
}
df.show

Each result is returned as an array. I just need to concatenate them together 
to make the S3 URL, and download the files per URL. This I need help with next.

Thanks,
Ben

> On Apr 17, 2016, at 7:38 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi!
> 
> Personally, I don't think it necessarily needs to be DataSet for your goal.
> 
> Just select your data at "s3" from DataFrame loaded by sqlContext.read.json().
> 
> You can try to printSchema() to check the nested schema and then select the 
> data.
> 
> Also, I guess (from your codes) you are trying to send a reauest, fetch the 
> response to driver-side, and then send each message to executor-side. I guess 
> there would be really heavy overhead in driver-side.
> Holden,
> 
> If I were to use DataSets, then I would essentially do this:
> 
> val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
> val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
> for (message <- messages.asScala) {
> val files = sqlContext.read.json(message.getBody())
> }
> 
> Can I simply do files.toDS() or do I have to create a schema using a case 
> class File and apply it as[File]? If I have to apply a schema, then how would 
> I create it based on the JSON structure below, especially the nested elements.
> 
> Thanks,
> Ben
> 
> 
>> On Apr 14, 2016, at 3:46 PM, Holden Karau <hol...@pigscanfly.ca 
>> <mailto:hol...@pigscanfly.ca>> wrote:
>> 
>> You could certainly use RDDs for that, you might also find using Dataset 
>> selecting the fields you need to construct the URL to fetch and then using 
>> the map function to be easier.
>> 
>> On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I was wonder what would be the best way to use JSON in Spark/Scala. I need 
>> to lookup values of fields in a collection of records to form a URL and 
>> download that file at that location. I was thinking an RDD would be perfect 
>> for this. I just want to hear from others who might have more experience in 
>> this. Below is the actual JSON structure that I am trying to use for the S3 
>> bucket and key values of each “record" within “Records".
>> 
>> {
>>"Records":[
>>   {
>>  "eventVersion":"2.0",
>>  "eventSource":"aws:s3",
>>  "awsRegion":"us-east-1",
>>  "eventTime":The time, in ISO-8601 format, for example, 
>> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>>  "eventName":"event-type",
>>  "userIdentity":{
>> 
>> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>>  },
>>  "requestParameters":{
>> "sourceIPAddress":"ip-address-where-request-came-from"
>>  },
>>  "responseElements":{
>> "x-amz-request-id":"Amazon S3 generated request ID",
>> "x-amz-id-2":"Amazon S3 host that processed the request"
>>  },
>>  "s3":{
>> "s3SchemaVersion":"1.0",
>> "configurationId":"ID found in the bucket notification 
>> configuration",
>> "bucket":{
>>"name":"bucket-name",
>>"ownerIdentity":{
>>   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>>},
>>"arn":"bucket-ARN"
>> },
>> "object":{
>>"key":"object-key",
>>"size":object-size,
>>"eTag":"object eTag",
>>"versionId":"object version if bucket is versioning-enabled, 
>> otherwise null",
>>"sequencer": "a string representation of a hexadecimal value 
>> used to determine event sequence,
>>only used with PUTs and DELETEs"
>> }
>>  }
>>   },
>>   {
>>   // Additional events
>>   }
>>]
>> }
>> 
>> Thanks
>> Ben
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
>> 
>> 
>> 
>> -- 
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>



Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
Thanks!

I got this to work.

val csvRdd = sc.parallelize(data.split("\n"))
val df = new 
com.databricks.spark.csv.CsvParser().withUseHeader(true).withInferSchema(true).csvRdd(sqlContext,
 csvRdd)

> On Apr 15, 2016, at 1:14 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi,
> 
> Would you try this codes below?
> 
> val csvRDD = ...your processimg for csv rdd..
> val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true)
> 
> Thanks!
> 
> On 16 Apr 2016 1:35 a.m., "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Hyukjin,
> 
> I saw that. I don’t know how to use it. I’m still learning Scala on my own. 
> Can you help me to start?
> 
> Thanks,
> Ben
> 
>> On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon <gurwls...@gmail.com 
>> <mailto:gurwls...@gmail.com>> wrote:
>> 
>> I hope it was not too late :).
>> 
>> It is possible.
>> 
>> Please check csvRdd api here, 
>> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150
>>  
>> <https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150>.
>> 
>> Thanks!
>> 
>> On 2 Apr 2016 2:47 a.m., "Benjamin Kim" <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Does anyone know if this is possible? I have an RDD loaded with rows of CSV 
>> data strings. Each string representing the header row and multiple rows of 
>> data along with delimiters. I would like to feed each thru a CSV parser to 
>> convert the data into a dataframe and, ultimately, UPSERT a Hive/HBase table 
>> with this data.
>> 
>> Please let me know if you have any ideas.
>> 
>> Thanks,
>> Ben
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
> 



Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
Is this right?


import com.databricks.spark.csv

val csvRdd = data.flatMap(x => x.split("\n"))
val df = new CsvParser().csvRdd(sqlContext, csvRdd, useHeader = true)

Thanks,
Ben


> On Apr 15, 2016, at 1:14 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi,
> 
> Would you try this codes below?
> 
> val csvRDD = ...your processimg for csv rdd..
> val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true)
> 
> Thanks!
> 
> On 16 Apr 2016 1:35 a.m., "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Hyukjin,
> 
> I saw that. I don’t know how to use it. I’m still learning Scala on my own. 
> Can you help me to start?
> 
> Thanks,
> Ben
> 
>> On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon <gurwls...@gmail.com 
>> <mailto:gurwls...@gmail.com>> wrote:
>> 
>> I hope it was not too late :).
>> 
>> It is possible.
>> 
>> Please check csvRdd api here, 
>> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150
>>  
>> <https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150>.
>> 
>> Thanks!
>> 
>> On 2 Apr 2016 2:47 a.m., "Benjamin Kim" <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Does anyone know if this is possible? I have an RDD loaded with rows of CSV 
>> data strings. Each string representing the header row and multiple rows of 
>> data along with delimiters. I would like to feed each thru a CSV parser to 
>> convert the data into a dataframe and, ultimately, UPSERT a Hive/HBase table 
>> with this data.
>> 
>> Please let me know if you have any ideas.
>> 
>> Thanks,
>> Ben
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> <mailto:user-h...@spark.apache.org>
>> 
> 



Re: JSON Usage

2016-04-15 Thread Benjamin Kim
Holden,

If I were to use DataSets, then I would essentially do this:

val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
for (message <- messages.asScala) {
val files = sqlContext.read.json(message.getBody())
}

Can I simply do files.toDS() or do I have to create a schema using a case class 
File and apply it as[File]? If I have to apply a schema, then how would I 
create it based on the JSON structure below, especially the nested elements.

Thanks,
Ben


> On Apr 14, 2016, at 3:46 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
> 
> You could certainly use RDDs for that, you might also find using Dataset 
> selecting the fields you need to construct the URL to fetch and then using 
> the map function to be easier.
> 
> On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I was wonder what would be the best way to use JSON in Spark/Scala. I need to 
> lookup values of fields in a collection of records to form a URL and download 
> that file at that location. I was thinking an RDD would be perfect for this. 
> I just want to hear from others who might have more experience in this. Below 
> is the actual JSON structure that I am trying to use for the S3 bucket and 
> key values of each “record" within “Records".
> 
> {
>"Records":[
>   {
>  "eventVersion":"2.0",
>  "eventSource":"aws:s3",
>  "awsRegion":"us-east-1",
>  "eventTime":The time, in ISO-8601 format, for example, 
> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>  "eventName":"event-type",
>  "userIdentity":{
> 
> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>  },
>  "requestParameters":{
> "sourceIPAddress":"ip-address-where-request-came-from"
>  },
>  "responseElements":{
> "x-amz-request-id":"Amazon S3 generated request ID",
> "x-amz-id-2":"Amazon S3 host that processed the request"
>  },
>  "s3":{
> "s3SchemaVersion":"1.0",
> "configurationId":"ID found in the bucket notification 
> configuration",
> "bucket":{
>"name":"bucket-name",
>"ownerIdentity":{
>   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>},
>"arn":"bucket-ARN"
> },
> "object":{
>"key":"object-key",
>"size":object-size,
>"eTag":"object eTag",
>"versionId":"object version if bucket is versioning-enabled, 
> otherwise null",
>"sequencer": "a string representation of a hexadecimal value 
> used to determine event sequence,
>only used with PUTs and DELETEs"
> }
>  }
>   },
>   {
>   // Additional events
>   }
>]
> }
> 
> Thanks
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 
> 
> 
> -- 
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau <https://twitter.com/holdenkarau>


Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Benjamin Kim
Hi Hyukjin,

I saw that. I don’t know how to use it. I’m still learning Scala on my own. Can 
you help me to start?

Thanks,
Ben

> On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> I hope it was not too late :).
> 
> It is possible.
> 
> Please check csvRdd api here, 
> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150
>  
> <https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150>.
> 
> Thanks!
> 
> On 2 Apr 2016 2:47 a.m., "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Does anyone know if this is possible? I have an RDD loaded with rows of CSV 
> data strings. Each string representing the header row and multiple rows of 
> data along with delimiters. I would like to feed each thru a CSV parser to 
> convert the data into a dataframe and, ultimately, UPSERT a Hive/HBase table 
> with this data.
> 
> Please let me know if you have any ideas.
> 
> Thanks,
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 



JSON Usage

2016-04-14 Thread Benjamin Kim
I was wonder what would be the best way to use JSON in Spark/Scala. I need to 
lookup values of fields in a collection of records to form a URL and download 
that file at that location. I was thinking an RDD would be perfect for this. I 
just want to hear from others who might have more experience in this. Below is 
the actual JSON structure that I am trying to use for the S3 bucket and key 
values of each “record" within “Records".

{  
   "Records":[  
  {  
 "eventVersion":"2.0",
 "eventSource":"aws:s3",
 "awsRegion":"us-east-1",
 "eventTime":The time, in ISO-8601 format, for example, 
1970-01-01T00:00:00.000Z, when S3 finished processing the request,
 "eventName":"event-type",
 "userIdentity":{  
"principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
 },
 "requestParameters":{  
"sourceIPAddress":"ip-address-where-request-came-from"
 },
 "responseElements":{  
"x-amz-request-id":"Amazon S3 generated request ID",
"x-amz-id-2":"Amazon S3 host that processed the request"
 },
 "s3":{  
"s3SchemaVersion":"1.0",
"configurationId":"ID found in the bucket notification 
configuration",
"bucket":{  
   "name":"bucket-name",
   "ownerIdentity":{  
  "principalId":"Amazon-customer-ID-of-the-bucket-owner"
   },
   "arn":"bucket-ARN"
},
"object":{  
   "key":"object-key",
   "size":object-size,
   "eTag":"object eTag",
   "versionId":"object version if bucket is versioning-enabled, 
otherwise null",
   "sequencer": "a string representation of a hexadecimal value 
used to determine event sequence, 
   only used with PUTs and DELETEs"
}
 }
  },
  {
  // Additional events
  }
   ]
}

Thanks
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Data Export

2016-04-14 Thread Benjamin Kim
Does anyone know when the exporting of data into CSV, TSV, etc. files will be 
available?

Thanks,
Ben

Re: Spark on Kudu

2016-04-13 Thread Benjamin Kim
Chris,

That would be great! And a first! I think everyone would take notice if KImpala 
had this.

Cheers,
Ben


> On Apr 13, 2016, at 8:22 AM, Chris George <christopher.geo...@rms.com> wrote:
> 
> SparkSQL cannot support these type of statements but we may be able to 
> implement similar functionality through the api.
> -Chris
> 
> On 4/12/16, 5:19 PM, "Benjamin Kim" <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
> It would be nice to adhere to the SQL:2003 standard for an “upsert” if it 
> were to be implemented.
> 
> MERGE INTO table_name USING table_reference ON (condition)
>  WHEN MATCHED THEN
>  UPDATE SET column1 = value1 [, column2 = value2 ...]
>  WHEN NOT MATCHED THEN
>  INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 …])
> 
> Cheers,
> Ben
> 
>> On Apr 11, 2016, at 12:21 PM, Chris George <christopher.geo...@rms.com 
>> <mailto:christopher.geo...@rms.com>> wrote:
>> 
>> I have a wip kuduRDD that I made a few months ago. I pushed it into gerrit 
>> if you want to take a look. http://gerrit.cloudera.org:8080/#/c/2754/ 
>> <http://gerrit.cloudera.org:8080/#/c/2754/>
>> It does pushdown predicates which the existing input formatter based rdd 
>> does not.
>> 
>> Within the next two weeks I’m planning to implement a datasource for spark 
>> that will have pushdown predicates and insertion/update functionality (need 
>> to look more at cassandra and the hbase datasource for best way to do this) 
>> I agree that server side upsert would be helpful.
>> Having a datasource would give us useful data frames and also make spark sql 
>> usable for kudu.
>> 
>> My reasoning for having a spark datasource and not using Impala is: 1. We 
>> have had trouble getting impala to run fast with high concurrency when 
>> compared to spark 2. We interact with datasources which do not integrate 
>> with impala. 3. We have custom sql query planners for extended sql 
>> functionality.
>> 
>> -Chris George
>> 
>> 
>> On 4/11/16, 12:22 PM, "Jean-Daniel Cryans" <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> You guys make a convincing point, although on the upsert side we'll need 
>> more support from the servers. Right now all you can do is an INSERT then, 
>> if you get a dup key, do an UPDATE. I guess we could at least add an API on 
>> the client side that would manage it, but it wouldn't be atomic.
>> 
>> J-D
>> 
>> On Mon, Apr 11, 2016 at 9:34 AM, Mark Hamstra <m...@clearstorydata.com 
>> <mailto:m...@clearstorydata.com>> wrote:
>> It's pretty simple, actually.  I need to support versioned datasets in a 
>> Spark SQL environment.  Instead of a hack on top of a Parquet data store, 
>> I'm hoping (among other reasons) to be able to use Kudu's write and 
>> timestamp-based read operations to support not only appending data, but also 
>> updating existing data, and even some schema migration.  The most typical 
>> use case is a dataset that is updated periodically (e.g., weekly or monthly) 
>> in which the the preliminary data in the previous window (week or month) is 
>> updated with values that are expected to remain unchanged from then on, and 
>> a new set of preliminary values for the current window need to be 
>> added/appended.
>> 
>> Using Kudu's Java API and developing additional functionality on top of what 
>> Kudu has to offer isn't too much to ask, but the ease of integration with 
>> Spark SQL will gate how quickly we would move to using Kudu and how 
>> seriously we'd look at alternatives before making that decision. 
>> 
>> On Mon, Apr 11, 2016 at 8:14 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> Mark,
>> 
>> Thanks for taking some time to reply in this thread, glad it caught the 
>> attention of other folks!
>> 
>> On Sun, Apr 10, 2016 at 12:33 PM, Mark Hamstra <m...@clearstorydata.com 
>> <mailto:m...@clearstorydata.com>> wrote:
>> Do they care being able to insert into Kudu with SparkSQL
>> 
>> I care about insert into Kudu with Spark SQL.  I'm currently delaying a 
>> refactoring of some Spark SQL-oriented insert functionality while trying to 
>> evaluate what to expect from Kudu.  Whether Kudu does a good job supporting 
>> inserts with Spark SQL will be a key consideration as to whether we adopt 
>> Kudu.
>> 
>> I'd like to know more about why SparkSQL inserts in necessary for you. Is it 
>> just that you currently 

Re: Spark on Kudu

2016-04-12 Thread Benjamin Kim
gt; 
> FWIW the plan is to get to 1.0 in late Summer/early Fall. At Cloudera all our 
> resources are committed to making things happen in time, and a more fully 
> featured Spark integration isn't in our plans during that period. I'm really 
> hoping someone in the community will help with Spark, the same way we got a 
> big contribution for the Flume sink. 
> 
> J-D
> 
> On Sun, Apr 10, 2016 at 11:29 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since it’s 
> not “production-ready”, upper management doesn’t want to fully deploy it yet. 
> They just want to keep an eye on it though. Kudu was so much simpler and 
> easier to use in every aspect compared to HBase. Impala was great for the 
> report writers and analysts to experiment with for the short time it was up. 
> But, once again, the only blocker was the lack of Spark support for our Data 
> Developers/Scientists. So, production-level data population won’t happen 
> until then.
> 
> I hope this helps you get an idea where I am coming from…
> 
> Cheers,
> Ben
> 
> 
>> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> J-D,
>> 
>> The main thing I hear that Cassandra is being used as an updatable hot data 
>> store to ensure that duplicates are taken care of and idempotency is 
>> maintained. Whether data was directly retrieved from Cassandra for 
>> analytics, reports, or searches, it was not clear as to what was its main 
>> use. Some also just used it for a staging area to populate downstream tables 
>> in parquet format. The last thing I heard was that CQL was terrible, so that 
>> rules out much use of direct queries against it.
>> 
>> I'm no C* expert, but I don't think CQL is meant for real analytics, just 
>> ease of use instead of plainly using the APIs. Even then, Kudu should beat 
>> it easily on big scans. Same for HBase. We've done benchmarks against the 
>> latter, not the former.
>>  
>> 
>> As for our company, we have been looking for an updatable data store for a 
>> long time that can be quickly queried directly either using Spark SQL or 
>> Impala or some other SQL engine and still handle TB or PB of data without 
>> performance degradation and many configuration headaches. For now, we are 
>> using HBase to take on this role with Phoenix as a fast way to directly 
>> query the data. I can see Kudu as the best way to fill this gap easily, 
>> especially being the closest thing to other relational databases out there 
>> in familiarity for the many SQL analytics people in our company. The other 
>> alternative would be to go with AWS Redshift for the same reasons, but it 
>> would come at a cost, of course. If we went with either solutions, Kudu or 
>> Redshift, it would get rid of the need to extract from HBase to parquet 
>> tables or export to PostgreSQL to support more of the SQL language using by 
>> analysts or the reporting software we use..
>> 
>> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. Have 
>> you folks tried Kudu with Impala yet with those use cases?
>>  
>> 
>> I hope this helps.
>> 
>> It does, thanks for nice reply.
>>  
>> 
>> Cheers,
>> Ben 
>> 
>>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>>> <mailto:jdcry...@apache.org>> wrote:
>>> 
>>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to 
>>> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who 
>>> were also there did say that the hype around Spark isn't dying down.
>>> 
>>> There's definitely an overlap in the use cases that Cassandra, HBase, and 
>>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim 
>>> solution for the use case you describe.
>>> 
>>> Nothing significant happened in Kudu over the past month, it's a storage 
>>> engine so things move slowly *smile*. I'd love to see more contributions on 
>>> the Spark front. I know there's code out there that could be integrated in 
>>> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily 
>>> review it.
>>> 
>>> Do you have relevant experiences you can share? I'd love to learn more 
>>> about the use cases for which you envision using Kud

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-12 Thread Benjamin Kim
All,

I have more of a general Scala JSON question.

I have setup a notification on the S3 source bucket that triggers a Lambda 
function to unzip the new file placed there. Then, it saves the unzipped CSV 
file into another destination bucket where a notification is sent to a SQS 
topic. The contents of the message body is in JSON having the top level be the 
“Records” collection where within are 1 or more “Record” objects. I would like 
to know how to iterate through the “Records” retrieving each “Record” to 
extract the bucket value and the key value. I would then use this information 
to download the file into a DataFrame via spark-csv. Does anyone have any 
experience doing this?

I wrote some quick stab at it, but I know it’s not right.

def functionToCreateContext(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(30))   // new context
val records = ssc.receiverStream(new SQSReceiver("amg-events")
.credentials(accessKey, secretKey)
.at(Regions.US_EAST_1)
.withTimeout(2))

records.foreach(record => {
val bucket = record['s3']['bucket']['name']
val key = record['s3']['object']['key']
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("s3://" + bucket + "/" + key)
//save to hbase
})

ssc.checkpoint(checkpointDirectory)   // set checkpoint directory
ssc
}

Thanks,
Ben

> On Apr 9, 2016, at 6:12 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> Ah, I spoke too soon.
> 
> I thought the SQS part was going to be a spark package. It looks like it has 
> be compiled into a jar for use. Am I right? Can someone help with this? I 
> tried to compile it using SBT, but I’m stuck with a SonatypeKeys not found 
> error.
> 
> If there’s an easier alternative, please let me know.
> 
> Thanks,
> Ben
> 
> 
>> On Apr 9, 2016, at 2:49 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> This was easy!
>> 
>> I just created a notification on a source S3 bucket to kick off a Lambda 
>> function that would decompress the dropped file and save it to another S3 
>> bucket. In return, this S3 bucket has a notification to send a SNS message 
>> to me via email. I can just as easily setup SQS to be the endpoint of this 
>> notification. This would then convey to a listening Spark Streaming job the 
>> file information to download. I like this!
>> 
>> Cheers,
>> Ben 
>> 
>>> On Apr 9, 2016, at 9:54 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>> This is awesome! I have someplace to start from.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>>> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com 
>>>> <mailto:programminggee...@gmail.com> wrote:
>>>> 
>>>> Someone please correct me if I am wrong as I am still rather green to 
>>>> spark, however it appears that through the S3 notification mechanism 
>>>> described below, you can publish events to SQS and use SQS as a streaming 
>>>> source into spark. The project at 
>>>> https://github.com/imapi/spark-sqs-receiver 
>>>> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries 
>>>> for doing this.
>>>> 
>>>> Hope this helps.
>>>> 
>>>> Sent from my iPhone
>>>> 
>>>> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> 
>>>>> Nezih,
>>>>> 
>>>>> This looks like a good alternative to having the Spark Streaming job 
>>>>> check for new files on its own. Do you know if there is a way to have the 
>>>>> Spark Streaming job get notified with the new file information and act 
>>>>> upon it? This can reduce the overhead and cost of polling S3. Plus, I can 
>>>>> use this to notify and kick off Lambda to process new data files and make 
>>>>> them ready for Spark Streaming to consume. This will also use 
>>>>> notifications to trigger. I just need to have all incoming folders 
>>>>> configured for notifications for Lambda and all outgoing folders for 
>>>>> Spark Streaming. This sounds like a better set

Re: Spark on Kudu

2016-04-10 Thread Benjamin Kim
Yes, we took Kudu for a test run using 0.6 and 0.7 versions. But, since it’s 
not “production-ready”, upper management doesn’t want to fully deploy it yet. 
They just want to keep an eye on it though. Kudu was so much simpler and easier 
to use in every aspect compared to HBase. Impala was great for the report 
writers and analysts to experiment with for the short time it was up. But, once 
again, the only blocker was the lack of Spark support for our Data 
Developers/Scientists. So, production-level data population won’t happen until 
then.

I hope this helps you get an idea where I am coming from…

Cheers,
Ben

> On Apr 10, 2016, at 11:08 AM, Jean-Daniel Cryans <jdcry...@apache.org> wrote:
> 
> On Sun, Apr 10, 2016 at 12:30 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> J-D,
> 
> The main thing I hear that Cassandra is being used as an updatable hot data 
> store to ensure that duplicates are taken care of and idempotency is 
> maintained. Whether data was directly retrieved from Cassandra for analytics, 
> reports, or searches, it was not clear as to what was its main use. Some also 
> just used it for a staging area to populate downstream tables in parquet 
> format. The last thing I heard was that CQL was terrible, so that rules out 
> much use of direct queries against it.
> 
> I'm no C* expert, but I don't think CQL is meant for real analytics, just 
> ease of use instead of plainly using the APIs. Even then, Kudu should beat it 
> easily on big scans. Same for HBase. We've done benchmarks against the 
> latter, not the former.
>  
> 
> As for our company, we have been looking for an updatable data store for a 
> long time that can be quickly queried directly either using Spark SQL or 
> Impala or some other SQL engine and still handle TB or PB of data without 
> performance degradation and many configuration headaches. For now, we are 
> using HBase to take on this role with Phoenix as a fast way to directly query 
> the data. I can see Kudu as the best way to fill this gap easily, especially 
> being the closest thing to other relational databases out there in 
> familiarity for the many SQL analytics people in our company. The other 
> alternative would be to go with AWS Redshift for the same reasons, but it 
> would come at a cost, of course. If we went with either solutions, Kudu or 
> Redshift, it would get rid of the need to extract from HBase to parquet 
> tables or export to PostgreSQL to support more of the SQL language using by 
> analysts or the reporting software we use..
> 
> Ok, the usual then *smile*. Looks like we're not too far off with Kudu. Have 
> you folks tried Kudu with Impala yet with those use cases?
>  
> 
> I hope this helps.
> 
> It does, thanks for nice reply.
>  
> 
> Cheers,
> Ben 
> 
>> On Apr 9, 2016, at 2:00 PM, Jean-Daniel Cryans <jdcry...@apache.org 
>> <mailto:jdcry...@apache.org>> wrote:
>> 
>> Ha first time I'm hearing about SMACK. Inside Cloudera we like to refer to 
>> "Impala + Kudu" as Kimpala, but yeah it's not as sexy. My colleagues who 
>> were also there did say that the hype around Spark isn't dying down.
>> 
>> There's definitely an overlap in the use cases that Cassandra, HBase, and 
>> Kudu cater to. I wouldn't go as far as saying that C* is just an interim 
>> solution for the use case you describe.
>> 
>> Nothing significant happened in Kudu over the past month, it's a storage 
>> engine so things move slowly *smile*. I'd love to see more contributions on 
>> the Spark front. I know there's code out there that could be integrated in 
>> kudu-spark, it just needs to land in gerrit. I'm sure folks will happily 
>> review it.
>> 
>> Do you have relevant experiences you can share? I'd love to learn more about 
>> the use cases for which you envision using Kudu as a C* replacement.
>> 
>> Thanks,
>> 
>> J-D
>> 
>> On Fri, Apr 8, 2016 at 12:45 PM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> Hi J-D,
>> 
>> My colleagues recently came back from Strata in San Jose. They told me that 
>> everything was about Spark and there is a big buzz about the SMACK stack 
>> (Spark, Mesos, Akka, Cassandra, Kafka). I still think that Cassandra is just 
>> an interim solution as a low-latency, easily queried data store. I was 
>> wondering if anything significant happened in regards to Kudu, especially on 
>> the Spark front. Plus, can you come up with your own proposed stack acronym 
>> to promote?
>> 
>> Cheers,
>> Ben
>> 
>> 
>>> On Mar 1, 2016, at 12:20 PM, Je

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Ah, I spoke too soon.

I thought the SQS part was going to be a spark package. It looks like it has be 
compiled into a jar for use. Am I right? Can someone help with this? I tried to 
compile it using SBT, but I’m stuck with a SonatypeKeys not found error.

If there’s an easier alternative, please let me know.

Thanks,
Ben


> On Apr 9, 2016, at 2:49 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> This was easy!
> 
> I just created a notification on a source S3 bucket to kick off a Lambda 
> function that would decompress the dropped file and save it to another S3 
> bucket. In return, this S3 bucket has a notification to send a SNS message to 
> me via email. I can just as easily setup SQS to be the endpoint of this 
> notification. This would then convey to a listening Spark Streaming job the 
> file information to download. I like this!
> 
> Cheers,
> Ben 
> 
>> On Apr 9, 2016, at 9:54 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>> This is awesome! I have someplace to start from.
>> 
>> Thanks,
>> Ben
>> 
>> 
>>> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com 
>>> <mailto:programminggee...@gmail.com> wrote:
>>> 
>>> Someone please correct me if I am wrong as I am still rather green to 
>>> spark, however it appears that through the S3 notification mechanism 
>>> described below, you can publish events to SQS and use SQS as a streaming 
>>> source into spark. The project at 
>>> https://github.com/imapi/spark-sqs-receiver 
>>> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries 
>>> for doing this.
>>> 
>>> Hope this helps.
>>> 
>>> Sent from my iPhone
>>> 
>>> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> 
>>>> Nezih,
>>>> 
>>>> This looks like a good alternative to having the Spark Streaming job check 
>>>> for new files on its own. Do you know if there is a way to have the Spark 
>>>> Streaming job get notified with the new file information and act upon it? 
>>>> This can reduce the overhead and cost of polling S3. Plus, I can use this 
>>>> to notify and kick off Lambda to process new data files and make them 
>>>> ready for Spark Streaming to consume. This will also use notifications to 
>>>> trigger. I just need to have all incoming folders configured for 
>>>> notifications for Lambda and all outgoing folders for Spark Streaming. 
>>>> This sounds like a better setup than we have now.
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com 
>>>>> <mailto:nyigitb...@netflix.com>> wrote:
>>>>> 
>>>>> While it is doable in Spark, S3 also supports notifications: 
>>>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
>>>>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
>>>>> 
>>>>> 
>>>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
>>>>> <mailto:nlaucha...@gmail.com>> wrote:
>>>>> Hi Benjamin,
>>>>> 
>>>>> I have done it . The critical configuration items are the ones below :
>>>>> 
>>>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
>>>>> AccessKeyId)
>>>>>   
>>>>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
>>>>> AWSSecretAccessKey)
>>>>> 
>>>>>   val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder 
>>>>> ")
>>>>> 
>>>>> This code will probe for new S3 files created in your every batch 
>>>>> interval.
>>>>> 
>>>>> Thanks,
>>>>> Natu
>>>>> 
>>>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
>>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
>>>>> pulled any new files to process? If so, can you provide basic Scala 
>>>>> coding help on this?
>>>>> 
>>>>> Thanks,
>>>>> Ben
>>>>> 
>>>>> 
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>>>> <mailto:user-h...@spark.apache.org>
>>>>> 
>>>>> 
>>>> 
>> 
> 



Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This was easy!

I just created a notification on a source S3 bucket to kick off a Lambda 
function that would decompress the dropped file and save it to another S3 
bucket. In return, this S3 bucket has a notification to send a SNS message to 
me via email. I can just as easily setup SQS to be the endpoint of this 
notification. This would then convey to a listening Spark Streaming job the 
file information to download. I like this!

Cheers,
Ben 

> On Apr 9, 2016, at 9:54 AM, Benjamin Kim <bbuil...@gmail.com> wrote:
> 
> This is awesome! I have someplace to start from.
> 
> Thanks,
> Ben
> 
> 
>> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com 
>> <mailto:programminggee...@gmail.com> wrote:
>> 
>> Someone please correct me if I am wrong as I am still rather green to spark, 
>> however it appears that through the S3 notification mechanism described 
>> below, you can publish events to SQS and use SQS as a streaming source into 
>> spark. The project at https://github.com/imapi/spark-sqs-receiver 
>> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries 
>> for doing this.
>> 
>> Hope this helps.
>> 
>> Sent from my iPhone
>> 
>> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> 
>>> Nezih,
>>> 
>>> This looks like a good alternative to having the Spark Streaming job check 
>>> for new files on its own. Do you know if there is a way to have the Spark 
>>> Streaming job get notified with the new file information and act upon it? 
>>> This can reduce the overhead and cost of polling S3. Plus, I can use this 
>>> to notify and kick off Lambda to process new data files and make them ready 
>>> for Spark Streaming to consume. This will also use notifications to 
>>> trigger. I just need to have all incoming folders configured for 
>>> notifications for Lambda and all outgoing folders for Spark Streaming. This 
>>> sounds like a better setup than we have now.
>>> 
>>> Thanks,
>>> Ben
>>> 
>>>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com 
>>>> <mailto:nyigitb...@netflix.com>> wrote:
>>>> 
>>>> While it is doable in Spark, S3 also supports notifications: 
>>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
>>>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
>>>> 
>>>> 
>>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
>>>> <mailto:nlaucha...@gmail.com>> wrote:
>>>> Hi Benjamin,
>>>> 
>>>> I have done it . The critical configuration items are the ones below :
>>>> 
>>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
>>>> AccessKeyId)
>>>>   
>>>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
>>>> AWSSecretAccessKey)
>>>> 
>>>>   val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder 
>>>> ")
>>>> 
>>>> This code will probe for new S3 files created in your every batch interval.
>>>> 
>>>> Thanks,
>>>> Natu
>>>> 
>>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
>>>> <mailto:bbuil...@gmail.com>> wrote:
>>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
>>>> pulled any new files to process? If so, can you provide basic Scala coding 
>>>> help on this?
>>>> 
>>>> Thanks,
>>>> Ben
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>>> <mailto:user-h...@spark.apache.org>
>>>> 
>>>> 
>>> 
> 



Re: Spark Plugin Information

2016-04-09 Thread Benjamin Kim
Josh,

For my tests, I’m passing the Zookeeper Quorum URL.

"zkUrl" -> 
"prod-nj3-hbase-master001.pnj3i.gradientx.com,prod-nj3-namenode001.pnj3i.gradientx.com,prod-nj3-namenode002.pnj3i.gradientx.com:2181”

Is this correct?

Thanks,
Ben


> On Apr 9, 2016, at 8:06 AM, Josh Mahonin <jmaho...@gmail.com> wrote:
> 
> Hi Ben,
> 
> It looks like a connection URL issue. Are you passing the correct 'zkUrl' 
> parameter, or do you have the HBase Zookeeper quorum defined in an 
> hbase-site.xml available in the classpath?
> 
> If you're able to connect to Phoenix using JDBC, you should be able to take 
> the JDBC url, pop off the 'jdbc:phoenix:' prefix and use it as the 'zkUrl' 
> option.
> 
> Josh
> 
> On Fri, Apr 8, 2016 at 6:47 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Josh,
> 
> I am using CDH 5.5.2 with HBase 1.0.0, Phoenix 4.5.2, and Spark 1.6.0. I 
> looked up the error and found others who led me to ask the question. I’ll try 
> to use Phoenix 4.7.0 client jar and see what happens.
> 
> The error I am getting is:
> 
> java.sql.SQLException: ERROR 103 (08004): Unable to establish connection.
>   at 
> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newException(SQLExceptionCode.java:388)
>   at 
> org.apache.phoenix.exception.SQLExceptionInfo.buildException(SQLExceptionInfo.java:145)
>   at 
> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:296)
>   at 
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$300(ConnectionQueryServicesImpl.java:179)
>   at 
> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1917)
>   at 
> org.apache.phoenix.query.ConnectionQueryServicesImpl$12.call(ConnectionQueryServicesImpl.java:1896)
>   at 
> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixContextExecutor.java:77)
>   at 
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(ConnectionQueryServicesImpl.java:1896)
>   at 
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices(PhoenixDriver.java:180)
>   at 
> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.connect(PhoenixEmbeddedDriver.java:132)
>   at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:151)
>   at java.sql.DriverManager.getConnection(DriverManager.java:571)
>   at java.sql.DriverManager.getConnection(DriverManager.java:187)
>   at 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getConnection(ConnectionUtil.java:93)
>   at 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:57)
>   at 
> org.apache.phoenix.mapreduce.util.ConnectionUtil.getInputConnection(ConnectionUtil.java:45)
>   at 
> org.apache.phoenix.mapreduce.util.PhoenixConfigurationUtil.getSelectColumnMetadataList(PhoenixConfigurationUtil.java:280)
>   at org.apache.phoenix.spark.PhoenixRDD.toDataFrame(PhoenixRDD.scala:101)
>   at 
> org.apache.phoenix.spark.PhoenixRelation.schema(PhoenixRelation.scala:57)
>   at 
> org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1153)
>   at 
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:30)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:38)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42)
>   at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
>   at $iwC$$iwC$$iwC$$iwC$$iwC.(:46)
>   at $iwC$$iwC$$iwC$$iwC.(:48)
>   at $iwC$$iwC$$iwC.(:50)
>   at $iwC$$iwC.(:52)
>   at $iwC.(:54)
>   at (:56)
>   at .(:60)
>   at .()
>   at .(:7)
>   at .()
>   at $print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>   at 
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>   at 
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>   at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>   at 
> org.apache.spark

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
This is awesome! I have someplace to start from.

Thanks,
Ben


> On Apr 9, 2016, at 9:45 AM, programminggee...@gmail.com wrote:
> 
> Someone please correct me if I am wrong as I am still rather green to spark, 
> however it appears that through the S3 notification mechanism described 
> below, you can publish events to SQS and use SQS as a streaming source into 
> spark. The project at https://github.com/imapi/spark-sqs-receiver 
> <https://github.com/imapi/spark-sqs-receiver> appears to provide libraries 
> for doing this.
> 
> Hope this helps.
> 
> Sent from my iPhone
> 
> On Apr 9, 2016, at 9:55 AM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> 
>> Nezih,
>> 
>> This looks like a good alternative to having the Spark Streaming job check 
>> for new files on its own. Do you know if there is a way to have the Spark 
>> Streaming job get notified with the new file information and act upon it? 
>> This can reduce the overhead and cost of polling S3. Plus, I can use this to 
>> notify and kick off Lambda to process new data files and make them ready for 
>> Spark Streaming to consume. This will also use notifications to trigger. I 
>> just need to have all incoming folders configured for notifications for 
>> Lambda and all outgoing folders for Spark Streaming. This sounds like a 
>> better setup than we have now.
>> 
>> Thanks,
>> Ben
>> 
>>> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com 
>>> <mailto:nyigitb...@netflix.com>> wrote:
>>> 
>>> While it is doable in Spark, S3 also supports notifications: 
>>> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
>>> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
>>> 
>>> 
>>> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
>>> <mailto:nlaucha...@gmail.com>> wrote:
>>> Hi Benjamin,
>>> 
>>> I have done it . The critical configuration items are the ones below :
>>> 
>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
>>> AccessKeyId)
>>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
>>> AWSSecretAccessKey)
>>> 
>>>   val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder 
>>> ")
>>> 
>>> This code will probe for new S3 files created in your every batch interval.
>>> 
>>> Thanks,
>>> Natu
>>> 
>>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
>>> <mailto:bbuil...@gmail.com>> wrote:
>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
>>> pulled any new files to process? If so, can you provide basic Scala coding 
>>> help on this?
>>> 
>>> Thanks,
>>> Ben
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>>> <mailto:user-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: user-h...@spark.apache.org 
>>> <mailto:user-h...@spark.apache.org>
>>> 
>>> 
>> 



Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Nezih,

This looks like a good alternative to having the Spark Streaming job check for 
new files on its own. Do you know if there is a way to have the Spark Streaming 
job get notified with the new file information and act upon it? This can reduce 
the overhead and cost of polling S3. Plus, I can use this to notify and kick 
off Lambda to process new data files and make them ready for Spark Streaming to 
consume. This will also use notifications to trigger. I just need to have all 
incoming folders configured for notifications for Lambda and all outgoing 
folders for Spark Streaming. This sounds like a better setup than we have now.

Thanks,
Ben

> On Apr 9, 2016, at 12:25 AM, Nezih Yigitbasi <nyigitb...@netflix.com> wrote:
> 
> While it is doable in Spark, S3 also supports notifications: 
> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html 
> <http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html>
> 
> 
> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com 
> <mailto:nlaucha...@gmail.com>> wrote:
> Hi Benjamin,
> 
> I have done it . The critical configuration items are the ones below :
> 
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
> AccessKeyId)
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
> AWSSecretAccessKey)
> 
>   val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder")
> 
> This code will probe for new S3 files created in your every batch interval.
> 
> Thanks,
> Natu
> 
> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
> pulled any new files to process? If so, can you provide basic Scala coding 
> help on this?
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Benjamin Kim
Natu,

Do you know if textFileStream can see if new files are created underneath a 
whole bucket? For example, if the bucket name is incoming and new files 
underneath it are 2016/04/09/00/00/01/data.csv and 
2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will Spark 
Streaming not pick up these files again on the following run knowing that it 
already picked them up or do we have to store state somewhere, like the last 
run date and time to compare against?

Thanks,
Ben

> On Apr 8, 2016, at 9:15 PM, Natu Lauchande <nlaucha...@gmail.com> wrote:
> 
> Hi Benjamin,
> 
> I have done it . The critical configuration items are the ones below :
> 
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", 
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", 
> AccessKeyId)
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", 
> AWSSecretAccessKey)
> 
>   val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder")
> 
> This code will probe for new S3 files created in your every batch interval.
> 
> Thanks,
> Natu
> 
> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Has anyone monitored an S3 bucket or directory using Spark Streaming and 
> pulled any new files to process? If so, can you provide basic Scala coding 
> help on this?
> 
> Thanks,
> Ben
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Re: Spark Plugin Information

2016-04-08 Thread Benjamin Kim
$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at 
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at 
org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at 
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: java.lang.reflect.InvocationTargetException
at 
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:240)
at 
org.apache.hadoop.hbase.client.ConnectionManager.createConnection(ConnectionManager.java:414)
at 
org.apache.hadoop.hbase.client.ConnectionManager.createConnectionInternal(ConnectionManager.java:323)
at 
org.apache.hadoop.hbase.client.HConnectionManager.createConnection(HConnectionManager.java:144)
at 
org.apache.phoenix.query.HConnectionFactory$HConnectionFactoryImpl.createConnection(HConnectionFactory.java:47)
at 
org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection(ConnectionQueryServicesImpl.java:294)
... 73 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection(ConnectionFactory.java:238)
... 78 more
Caused by: java.lang.UnsupportedOperationException: Unable to find 
org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at 
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:36)
at 
org.apache.hadoop.hbase.ipc.RpcControllerFactory.instantiate(RpcControllerFactory.java:58)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.createAsyncProcess(ConnectionManager.java:2317)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:688)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.(ConnectionManager.java:630)
... 83 more
Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.hbase.ipc.controller.ClientRpcControllerFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:190)
at 
org.apache.hadoop.hbase.util.ReflectionUtils.instantiateWithCustomCtor(ReflectionUtils.java:32)
... 87 more

Thanks,
Ben


> On Apr 8, 2016, at 2:23 PM, Josh Mahonin <jmaho...@gmail.com> wrote:
> 
> Hi Ben,
> 
> If you have a reproducible test case, please file a JIRA for it. The 
> documentation (https://phoenix.apache.org/phoenix_spark.html 
> <https://phoenix.apache.org/phoenix_spark.html>) is accurate and verified for 
> up to Phoenix 4.7.0 and Spark 1.6.0.
> 
> Although not supported by the Phoenix project at large, you may find this 
> Docker image useful as a configuration reference:
> https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark 
> <https://github.com/jmahonin/docker-phoenix/tree/phoenix_spark>
> 
> Good luck!
> 
> Josh
> 
> On Fri, Apr 8, 2016 at 3:11 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbu

Re: Zeppelin Dashboards

2016-04-08 Thread Benjamin Kim
Ashish,

Quick question…

Does this include charts updating in near real-time?

Just wondering.

Cheers,
Ben


> On Apr 6, 2016, at 12:21 PM, moon soo Lee  wrote:
> 
> Hi Ashish,
> 
> Would tweaking looknfeel 
> https://github.com/apache/incubator-zeppelin/tree/master/zeppelin-web/src/assets/styles/looknfeel
>  
> 
>  helps in your use case?
> 
> Thanks,
> moon
> 
> On Tue, Apr 5, 2016 at 5:24 PM ashish rawat  > wrote:
> Thanks Johnny. I was able to get an initial approval on using Zeppelin in my 
> project, by using some of the things mentioned in those links :)
> 
> However, I still struggle a lot with it, in getting things right. For 
> example, if someone has to decrease the width of those input boxes and fit 
> more in a row, or increase the number of rows in the table. It's fine to have 
> a simple UI for your own ad-hoc analytics, but if you are making a dashboard 
> for customers, then it needs to be perfect.
> 
> Is there a good place to get started for tweaking these UI components? I 
> couldn't find much options in Zeppelin for Elastic Search, apart from elastic 
> search queries.
> 
> Regards,
> Ashish
> 
> 
> 
> On Sat, Apr 2, 2016 at 10:53 AM, Johnny W.  > wrote:
> Hello Ashish,
> 
> I am not sure about whether you how advanced dashboards you want to build, 
> but we had some great experience building simple dashboards using dynamic 
> forms and rest APIs. Hope these links can help you:
> --
> https://zeppelin.incubator.apache.org/docs/0.6.0-incubating-SNAPSHOT/manual/dynamicform.html
>  
> 
> https://zeppelin.incubator.apache.org/docs/0.6.0-incubating-SNAPSHOT/rest-api/rest-notebook.html
>  
> 
> 
> moon's example of programmatic graph configuration:
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL0xlZW1vb25zb28vemVwcGVsaW4tZXhhbXBsZXMvbWFzdGVyLzJCOFhRVUM1Qi9ub3RlLmpzb24
>  
> 
> 
> rest API example:
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL3pob25nbmV1L215LXplcHBlbGluLW5vdGVzL21hc3Rlci8yQkNBSjZHVkEvbm90ZS5qc29u
>  
> 
> 
> An example of dropdown menu/autocomplete:
> https://www.zeppelinhub.com/viewer/notebooks/aHR0cHM6Ly9yYXcuZ2l0aHVidXNlcmNvbnRlbnQuY29tL2Nvcm5lYWRvdWcvWmVwcGVsaW4tTm90ZWJvb2tzL21hc3Rlci9BdXRvLUNvbXBsZXRlLU11bHRpU2VsZWN0L25vdGUuanNvbg
>  
> 
> 
> 
> Thanks,
> Johnny
> 
> On Tue, Mar 29, 2016 at 11:50 PM, ashish rawat  > wrote:
> Thanks Moon. I do not have a good understanding of UI but really appreciate 
> the ease with which non-UI folks can get started with Zeppelin.
> 
> My end goal is to enable the customers to make a dashboard quickly. But I 
> don't want to branch off from Zeppelin and create a custom offering. So, can 
> I assume that these changes can be pushed back to Zeppelin and used by anyone 
> to create a new dashboard.
> 
> Regards,
> Ashish
> 
> On Thu, Mar 24, 2016 at 9:49 PM, moon soo Lee  > wrote:
> Hi Ashish,
> 
> Thanks for the question.
> I think you'll need to leverage Angular display system [1] to achieve the 
> features you want.
> 
> With Angular displays system, you can create date picker and update all 
> paragraphs on change. Same to Autocomplete search box.
> For linked graph (drilling down), i think you need to render your own graph 
> inside of notebook and handle event from the graph.
> 
> So it's pretty much possible but you'll need to write many codes inside of 
> notebook.
> If you have any good idea or suggestion, please feel free to share.
> 
> Thanks.
> moon
> 
> [1] 
> http://zeppelin.incubator.apache.org/docs/0.6.0-incubating-SNAPSHOT/displaysystem/angular.html
>  
> 
> 
> On Thu, Mar 24, 2016 at 6:48 AM ashish rawat  > wrote:
> Hi,
> 
> Can someone please help. Also, if these are not possible currently, then can 
> Zeppelin be extended to make this possible?
> 
> Regards,
> Ashish
> 
> On 

Re: Zeppelin UX Design Roadmap Proposal

2016-04-08 Thread Benjamin Kim
Hi Jeremy,

I was wondering if the code bar could be placed on the side too? Appear and 
disappear too? And quick view the contents upon rollover?

Otherwise, what you put together is great!

Cheers,
Ben

> On Apr 7, 2016, at 1:44 PM, Jeremy Anderson  
> wrote:
> 
> Hi all,
>  
> (sending from my consulting email, as my other email account keeps bouncing)
> 
> Some of you, I know and have met with in person. Others of you, I know by 
> name from these mailing lists. I'm a UX design lead at the Spark Technology 
> Center in San Francisco. I've been in conversation with a number of you about 
> UX design enhancements for the Zeppelin notebook. I've drafted an initial 
> proposal to start a UX roadmap for Zeppelin and begin a conversation. You can 
> find the proposal on Confluence at: 
> https://cwiki.apache.org/confluence/display/ZEPPELIN/Zeppelin+UX+Roadmap 
> 
>  
> I've outlined some initial themes and included a rough prototype video to 
> help capture the interaction (though it is a little rough). 
> 
> This roadmap is still a rough draft proposal. I will be adding to it moving 
> forward and encourage feedback. Looking forward to hearing from you.
>  
> Cheers,
>  
> Jeremy
> 
>  
> Jeremy Anderson
> UX Design Lead  |  Spark Technology Center  |  IBM
> 425 Market Street, 20th Floor  |  San Francisco, CA 94105
> spark.tc  | Linkedin 
> 


Spark Plugin Information

2016-04-08 Thread Benjamin Kim
I want to know if there is a update/patch coming to Spark or the Spark plugin? 
I see that the Spark plugin does not work because HBase classes are missing 
from the Spark Assembly jar. So, when Spark does reflection, it does not look 
for HBase client classes in the Phoenix Plugin jar but only in the Spark 
Assembly jar. Is this true?

If someone can enlighten me on this topic, please let me know.

Thanks,
Ben

Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Benjamin Kim
Has anyone monitored an S3 bucket or directory using Spark Streaming and pulled 
any new files to process? If so, can you provide basic Scala coding help on 
this?

Thanks,
Ben


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Plugin Information

2016-04-08 Thread Benjamin Kim
I want to know if there is a update/patch coming to Spark or the Spark plugin? 
I see that the Spark plugin does not work because HBase classes are missing 
from the Spark Assembly jar. So, when Spark does reflection, it does not look 
for HBase client classes in the Phoenix Plugin jar but only in the Spark 
Assembly jar. Is this true?

If someone can enlighten me on this topic, please let me know.

Thanks,
Ben








can spark-csv package accept strings instead of files?

2016-04-01 Thread Benjamin Kim
Does anyone know if this is possible? I have an RDD loaded with rows of CSV 
data strings. Each string representing the header row and multiple rows of data 
along with delimiters. I would like to feed each thru a CSV parser to convert 
the data into a dataframe and, ultimately, UPSERT a Hive/HBase table with this 
data.

Please let me know if you have any ideas.

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Data Export

2016-04-01 Thread Benjamin Kim
I see no movement on this for a while. Are there any updates? I like that there 
is Tableau export included too.

Thanks,
Ben

> On Mar 16, 2016, at 12:50 AM, Khalid Huseynov <khalid...@nflabs.com> wrote:
> 
> This <https://github.com/apache/incubator-zeppelin/pull/761> seems to be the 
> most recent PR for that. 
> 
> On Wed, Mar 16, 2016 at 1:26 PM, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Any updates as to the progress of this issue?
> 
> 
>> On Feb 26, 2016, at 6:16 PM, Khalid Huseynov <khalid...@nflabs.com 
>> <mailto:khalid...@nflabs.com>> wrote:
>> 
>> As far as I know there're few PRs (#6 
>> <https://github.com/apache/incubator-zeppelin/pull/6>, #725 
>> <https://github.com/apache/incubator-zeppelin/pull/725>, #89 
>> <https://github.com/apache/incubator-zeppelin/pull/89>, #714 
>> <https://github.com/apache/incubator-zeppelin/pull/714>) addressing similar 
>> issue with some variation in approaches. They're being compared and probably 
>> some resolution should be reached. You can also take a look and express your 
>> opinion. Community may let us know if I'm missing something.
>> 
>> Best,
>> Khalid
>> 
>> On Sat, Feb 27, 2016 at 2:23 AM, Benjamin Kim <bbuil...@gmail.com 
>> <mailto:bbuil...@gmail.com>> wrote:
>> I don’t know if I’m missing something, but is there a way to export the 
>> result data into a CSV, Excel, etc. from a SQL statement?
>> 
>> Thanks,
>> Ben
>> 
>> 
> 
> 



Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
Hi Mich,

I forgot to mention that - this is the ugly part - the source data provider 
gives us (Windows) pkzip compressed files. Will spark uncompress these 
automatically? I haven’t been able to make it work.

Thanks,
Ben

> On Mar 30, 2016, at 2:27 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi Ben,
> 
> Well I have done it for standard csv files downloaded from spreadsheets to 
> staging directory on hdfs and loaded from there.
> 
> First you may not need to unzip them. dartabricks can read them (in my case) 
> and zipped files.
> 
> Check this. Mine is slightly different from what you have, First I zip my csv 
> files with bzip2 and load them into hdfs
> 
> #!/bin/ksh
> DIR="/data/stg/accounts/nw/10124772"
> #
> ## Compress the files
> #
> echo `date` " ""===  Started compressing all csv FILEs"
> for FILE in `ls *.csv`
> do
>   /usr/bin/bzip2 ${FILE}
> done
> #
> ## Clear out hdfs staging directory
> #
> echo `date` " ""===  Started deleting old files from hdfs staging 
> directory ${DIR}"
> hdfs dfs -rm -r ${DIR}/*.bz2
> echo `date` " ""===  Started Putting bz2 fileS to hdfs staging directory 
> ${DIR}"
> for FILE in `ls *.bz2`
> do
>   hdfs dfs -copyFromLocal ${FILE} ${DIR}
> done
> echo `date` " ""===  Checking that all files are moved to hdfs staging 
> directory"
> hdfs dfs -ls ${DIR}
> exit 0
> 
> Now you have all your csv files in the staging directory
> 
> import org.apache.spark.sql.functions._
> import java.sql.{Date, Timestamp}
> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> println ("\nStarted at"); sqlContext.sql("SELECT 
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') 
> ").collect.foreach(println)
> 
> val df = 
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", 
> "true").load("hdfs://rhes564:9000/data/stg/accounts/nw/10124772")
> case class Accounts( TransactionDate: String, TransactionType: String, 
> Description: String, Value: Double, Balance: Double, AccountName: String, 
> AccountNumber : String)
> // Map the columns to names
> //
> val a = df.filter(col("Date") > "").map(p => 
> Accounts(p(0).toString,p(1).toString,p(2).toString,p(3).toString.toDouble,p(4).toString.toDouble,p(5).toString,p(6).toString))
> //
> // Create a Spark temporary table
> //
> a.toDF.registerTempTable("tmp")
> 
> // Need to create and populate target ORC table nw_10124772 in database 
> accounts.in <http://accounts.in/> Hive
> //
> sql("use accounts")
> //
> // Drop and create table nw_10124772
> //
> sql("DROP TABLE IF EXISTS accounts.nw_10124772")
> var sqltext : String = ""
> sqltext = """
> CREATE TABLE accounts.nw_10124772 (
> TransactionDateDATE
> ,TransactionType   String
> ,Description   String
> ,Value Double
> ,Balance   Double
> ,AccountName   String
> ,AccountNumber Int
> )
> COMMENT 'from csv file from excel sheet'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> """
> sql(sqltext)
> //
> // Put data in Hive table. Clean up is already done
> //
> sqltext = """
> INSERT INTO TABLE accounts.nw_10124772
> SELECT
>   
> TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(TransactionDate,'dd/MM/'),'-MM-dd'))
>  AS TransactionDate
> , TransactionType
> , Description
> , Value
> , Balance
> , AccountName
>     , AccountNumber
> FROM tmp
> """
> sql(sqltext)
> 
> println ("\nFinished at"); sqlContext.sql("SELECT 
> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.fore
> 
> Once you store into a some form of table (Parquet, ORC) etc you can do 
> whatever you like with it.
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 March 2016 at 22:13, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> Hi Mich,
> 
> You are correct. I am talking about the Databricks package spark-csv you ha

Re: Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
Hi Mich,

You are correct. I am talking about the Databricks package spark-csv you have 
below.

The files are stored in s3 and I download, unzip, and store each one of them in 
a variable as a string using the AWS SDK (aws-java-sdk-1.10.60.jar).

Here is some of the code.

val filesRdd = sc.parallelize(lFiles, 250)
filesRdd.foreachPartition(files => {
  val s3Client = new AmazonS3Client(new 
EnvironmentVariableCredentialsProvider())
  files.foreach(file => {
val s3Object = s3Client.getObject(new GetObjectRequest(s3Bucket, file))
val zipFile = new ZipInputStream(s3Object.getObjectContent())
val csvFile = readZipStream(zipFile)
  })
})

This function does the unzipping and converts to string.

def readZipStream(stream: ZipInputStream): String = {
  stream.getNextEntry
  var stuff = new ListBuffer[String]()
  val scanner = new Scanner(stream)
  while(scanner.hasNextLine){
stuff += scanner.nextLine
  }
  stuff.toList.mkString("\n")
}

The next step is to parse the CSV string and convert to a dataframe, which will 
populate a Hive/HBase table.

If you can help, I would be truly grateful.

Thanks,
Ben


> On Mar 30, 2016, at 2:06 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> just to clarify are you talking about databricks csv package.
> 
> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.3.0
> 
> Where are these zipped files? Are they copied to a staging directory in hdfs?
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 30 March 2016 at 15:17, Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>> wrote:
> I have a quick question. I have downloaded multiple zipped files from S3 and 
> unzipped each one of them into strings. The next step is to parse using a CSV 
> parser. I want to know if there is a way to easily use the spark csv package 
> for this?
> 
> Thanks,
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> For additional commands, e-mail: user-h...@spark.apache.org 
> <mailto:user-h...@spark.apache.org>
> 
> 



Does Spark CSV accept a CSV String

2016-03-30 Thread Benjamin Kim
I have a quick question. I have downloaded multiple zipped files from S3 and 
unzipped each one of them into strings. The next step is to parse using a CSV 
parser. I want to know if there is a way to easily use the spark csv package 
for this?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Importaing Hbase data

2016-03-25 Thread Benjamin Kim
The hbase-spark module is still a work in progress in terms of Spark SQL. All 
the RDD methods are complete and ready to use against the current version of 
HBase 1.0+, but the use of DataFrames will require the unreleased version of 
HBase 2.0. Fortunately, there is work in progress to back-port the hbase-spark 
module to not have these deep rooted dependencies on HBase 2.0 (HBASE-14160). 
For more information on this, you can refer to 
http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
 to see what they are trying to accomplish.

> On Mar 25, 2016, at 9:17 AM, Silvio Fiorito  
> wrote:
> 
> There’s also this, which seems more current: 
> https://github.com/apache/hbase/tree/master/hbase-spark 
> 
> 
> I haven’t used it, but I know Ted Malaska and others from Cloudera have 
> worked heavily on it.
> 
> From: Felix Cheung  >
> Reply-To: "users@zeppelin.incubator.apache.org 
> " 
>  >
> Date: Friday, March 25, 2016 at 12:01 PM
> To: "users@zeppelin.incubator.apache.org 
> " 
>  >, 
> "users@zeppelin.incubator.apache.org 
> " 
>  >
> Subject: Re: Importaing Hbase data
> 
> You should be able to access that from Spark SQL through a package like 
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase 
> 
> 
> This package seems like have not been updated for a while though.
> 
> 
> 
> On Tue, Mar 22, 2016 at 11:06 AM -0700, "Kumiko Yada"  > wrote:
> 
> Hello,
>  
> Is there a way to importing Hbase data to the Zeppelin notebook using the 
> Spark SQL?
>  
> Thanks
> Kumiko



BinaryFiles to ZipInputStream

2016-03-23 Thread Benjamin Kim
I need a little help. I am loading into Spark 1.6 zipped csv files stored in s3.

First of all, I am able to get the List of file keys that have a modified date 
within a range of time by using the AWS SDK Objects (AmazonS3Client, 
ObjectListing, S3ObjectSummary, ListObjectsRequest, GetObjectRequest). Then, by 
setting up the HadoopConfiguration object with s3 access and secret keys, I 
parallelize, partition, and iterate through the List to load each file’s 
contents into a RDD[(String, org.apache.spark.input.PortableDataStream)].

val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", accessKey)
hadoopConf.set("fs.s3.awsSecretAccessKey", secretKey)

val filesRdd = sc.parallelize(lFiles)
filesRdd.foreachPartition(files => {
  val lZipFiles = files.map(x => sc.binaryFiles("s3://" + s3Bucket + "/" + x))
  val lZipStream = lZipFiles.map(x => new ZipInputStream(x)) // make them all 
zip input streams
  val lStrContent = lZipStream.map(x => readZipStream(x))  // read contents 
into string 

})

This is where I need help. I get this error.

:196: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[(String, 
org.apache.spark.input.PortableDataStream)]
 required: java.io.InputStream
val lZipStream = lZipFiles.map(x => new ZipInputStream(x)) // 
make them all zip input streams

   ^

Does anyone know how to load the PortableDataStream returned in a RDD and 
convert it into a ZipInputStream?

Thanks,
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: new object store driver for Spark

2016-03-22 Thread Benjamin Kim
Hi Gil,

Currently, our company uses S3 heavily for data storage. Can you further 
explain the benefits of this in relation to S3 when the pending patch does come 
out? Also, I have heard of Swift from others. Can you explain to me the pros 
and cons of Swift compared to HDFS? It can be just a brief summary if you like 
or just guide me to material that will help me get a better understanding.

Thanks,
Ben

> On Mar 22, 2016, at 6:35 AM, Gil Vernik  wrote:
> 
> We recently released an object store connector for Spark. 
> https://github.com/SparkTC/stocator 
> Currently this connector contains driver for the Swift based object store ( 
> like SoftLayer or any other Swift cluster ), but it can easily support 
> additional object stores.
> There is a pending patch to support Amazon S3 object store. 
> 
> The major highlight is that this connector doesn't create any temporary files 
>  and so it achieves very fast response times when Spark persist data in the 
> object store.
> The new connector supports speculate mode and covers various failure 
> scenarios ( like two Spark tasks writing into same object, partial corrupted 
> data due to run time exceptions in Spark master, etc ).  It also covers 
> https://issues.apache.org/jira/browse/SPARK-10063 
> and other known issues.
> 
> The detail algorithm for fault tolerance will be released very soon. For now, 
> those who interested, can view the implementation in the code itself.
> 
>  https://github.com/SparkTC/stocator 
> contains all the details how to setup 
> and use with Spark.
> 
> A series of tests showed that the new connector obtains 70% improvements for 
> write operations from Spark to Swift and about 30% improvements for read 
> operations from Swift into Spark ( comparing to the existing driver that 
> Spark uses to integrate with objects stored in Swift). 
> 
> There is an ongoing work to add more coverage and fix some known bugs / 
> limitations.
> 
> All the best
> Gil
> 



<    1   2   3   >