Re: Hbase in spark

2016-02-26 Thread Ted Malaska
Yes, and I have used HBASE-15271 and successful loaded over 20 billion records into HBase even with node failures. On Fri, Feb 26, 2016 at 11:55 AM, Ted Yu wrote: > In hbase, there is hbase-spark module which supports bulk load. > This module is to be backported in the upcoming 1.3.0 release. >

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
a/examples/spark/SparkWordCount.java > > Lastly, here is GoraSparkEngine: > > > https://github.com/kamaci/gora/blob/master/gora-core/src/main/java/org/apache/gora/spark/GoraSparkEngine.java > > Kind Regards, > Furkan KAMACI > > On Wed, Aug 26, 2015 at 4:40 PM, Ted Malaska

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
ra-hbase/src/test/java/org/apache/gora/hbase/mapreduce/TestHBaseStoreWordCount.java#L65 > > I've implemented a Spark backend for Apache Gora as GSoC project and this > is the latest obstacle that I should solve. If you can help me, you are > welcome. > > Kind Regards, > Fu

Re: Spark Cannot Connect to HBaseClusterSingleton

2015-08-26 Thread Ted Malaska
I've always used HBaseTestingUtility and never really had much trouble. I use that for all my unit testing between Spark and HBase. Here are some code examples if your interested --Main HBase-Spark Module https://github.com/apache/hbase/tree/master/hbase-spark --Unit test that cover all basic co

Re: 答复: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-13 Thread Ted Malaska
d > use Gets as much as possible. > > > > Thanks, > > > > > > *发件人:* Ted Malaska [mailto:ted.mala...@cloudera.com] > *发送时间:* 2015年8月12日 9:14 > *收件人:* Yan Zhou.sc > *抄送:* dev@spark.apache.org; Bing Xiao (Bing); Ted Yu; user > *主题:* RE: 答复: 答复: Package Rel

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
the past email. U will note in 14181 that the filter push will also limit the scan range or drop scan all together for gets. Ted Malaska On Aug 11, 2015 9:06 PM, "Yan Zhou.sc" wrote: > No, Astro bulkloader does not use its own shuffle. But map/reduce-side > processing is somewha

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
through a subset of > DataFrame functionalities like filter, projection, and other map-side ops, > would it be feasible to decouple it from Spark? > > My understanding is that 14181 does not run Spark execution engine at all, > but will make use of Spark Dataframe semantic an

RE: 答复: 答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-11 Thread Ted Malaska
user to do anything they did with MR/HBase now with Spark/Hbase. Things like bulk load. Let me know if u have any questions Ted Malaska On Aug 11, 2015 7:13 PM, "Yan Zhou.sc" wrote: > We have not “formally” published any numbers yet. A good reference is a > slide deck we pos

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
On Tue, Jul 28, 2015 at 12:23 PM, Ted Malaska wrote: > Stuff that people are using is here. > > https://github.com/cloudera-labs/SparkOnHBase > > The stuff going into HBase is here > https://issues.apache.org/jira/browse/HBASE-13992 > > If you want to add things to the hb

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
Functions 1. BulkPut 2. BulkGet 3. BulkDelete 4. Foreach with connection 5. Map with connection 6. Distributed Scan 7. BulkLoad DataFrame Functions 1. BulkPut 2. BulkGet 6. Distributed Scan 7. BulkLoad If you think there should be more let me know Ted Malaska On Tue, Jul 28, 2015 at 12:17 PM

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
nd > then mutating them after series of iterative (bsp-like) steps. > > On 28 July 2015 at 17:06, Ted Malaska wrote: > >> Thanks Michal, >> >> Just to share what I'm working on in a related topic. So a long time ago >> I build SparkOnHBase and put it into Cloud

Re: Generalised Spark-HBase integration

2015-07-28 Thread Ted Malaska
ple blog I also put together http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ Let me know if you have any questions, also let me know if you want to connect to join efforts. Ted Malaska On Tue, Jul 28, 2015 at 11:59 AM, Michal Ha

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I added the following jira https://issues.apache.org/jira/browse/SPARK-9237 Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska wrote: > Cool I will make a jira after I check in to my hotel. And try to get a > patch early next week

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
ced some strange > behaviour when testing it on small datasets. > > 2015-07-21 20:30 GMT+02:00 Ted Malaska : > >> Look at the implementation for frequently items. It is a different from >> true count. >> On Jul 21, 2015 1:19 PM, "Reynold Xin" wrote: >> &

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
tatFunctions.scala#L97 > > > > On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska > wrote: > >> 100% I would love to do it. Who a good person to review the design >> with. All I need is a quick chat about the design and approach and I'll >> create the jira and push a

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: >

Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, J

Re: Welcoming some new committers

2015-06-20 Thread Ted Malaska
Super congrats. Well earned. On Jun 20, 2015 12:48 PM, "Andrew Or" wrote: > Welcome! > > 2015-06-20 7:30 GMT-07:00 Debasish Das : > >> Congratulations to All. >> >> DB great work in bringing quasi newton methods to Spark ! >> >> On Wed, Jun 17, 2015 at 3:18 PM, Chester Chen >> wrote: >> >>> Con

Re: Spark-Submit issues

2014-11-12 Thread Ted Malaska
/hbase/lib/* SparkHBase.jar t1 c On Wed, Nov 12, 2014 at 4:25 PM, Hari Shreedharan wrote: > Yep, you’d need to shade jars to ensure all your dependencies are in the > classpath. > > Thanks, > Hari > > > On Wed, Nov 12, 2014 at 3:23 AM, Ted Malaska > wrote: > >>

Re: Spark-Submit issues

2014-11-12 Thread Ted Malaska
Hey this is Ted Are you using Shade when you build your jar and are you using the bigger jar? Looks like classes are not included in you jar. On Wed, Nov 12, 2014 at 2:09 AM, Jeniba Johnson < jeniba.john...@lntinfotech.com> wrote: > Hi Hari, > > Now Iam trying out the same FlumeEventCount examp

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Ted Malaska
This is a bad deal, great job. On Fri, Oct 10, 2014 at 11:19 AM, Mridul Muralidharan wrote: > Brilliant stuff ! Congrats all :-) > This is indeed really heartening news ! > > Regards, > Mridul > > > On Fri, Oct 10, 2014 at 8:24 PM, Matei Zaharia > wrote: > > Hi folks, > > > > I interrupt your r

Re: Compile error when compiling for cloudera

2014-07-17 Thread Ted Malaska
Don't make this change yet. I have a 1642 that needs to get through around the same code. I can make this change after 1642 is through. On Thu, Jul 17, 2014 at 12:25 PM, Sean Owen wrote: > CC tmalaska since he touched the line in question. This is a fun one. > So, here's the line of code adde