Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Xi Shen
Hi Chitturi, Please checkout https://spark.apache.org/docs/1.0.1/api/java/org/apache/spark/mllib/clustering/KMeans.html#setInitializationSteps(int ). I think it is caused by the initialization step. the "kmeans||" method does not initialize dataset in parallel. If your dataset is large, it takes

Re: Why KMeans with mllib is so slow ?

2016-03-12 Thread Chitturi Padma
Hi All, I am facing the same issue. taking k values from 60 to 120 incrementing by 10 each time i.e k takes value 60,70,80,...120 the algorithm takes around 2.5h on a 800 MB data set with 38 dimensions. On Sun, Mar 29, 2015 at 9:34 AM, davidshen84 [via Apache Spark User List] <

Re: forgetfulness in clustering algorithm

2016-03-12 Thread Chitturi Padma
Hi, I am interested in the Streaming k-means algorithm and the parameter forgetfulness. Please some one can throw light on this ? On Wed, Jul 29, 2015 at 11:23 AM, AmmarYasir [via Apache Spark User List] < ml-node+s1001560n24050...@n3.nabble.com> wrote: > > I read the post regarding

Re: How to efficiently query a large table with multiple dimensional table?

2016-03-12 Thread ashokkumar rajendran
Any input on this? Does it have something to do with SQL engine parser / optimizer? Please help. Regards Ashok On Fri, Mar 11, 2016 at 3:22 PM, ashokkumar rajendran < ashokkumar.rajend...@gmail.com> wrote: > Hi All, > > I have a large table with few billions of rows and have a very small table

Re: spark 1.6.0 connect to hive metastore

2016-03-12 Thread Timur Shenkao
I had similar issue with CDH 5.5.3. Not only with Spark 1.6 but with beeline as well. I resolved it via installation & running hiveserver2 role instance at the same server wher metastore is. On Tue, Feb 9, 2016 at 10:58 PM, Koert Kuipers

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Mich Talebzadeh
Certainly the only graphs that I can produce are from the SQL queries on base tables. That basically means that data has to be stored in a permanent tables so temporary tables in Spark cannot be used (?), Additionally it only seems to work on SQL only (and I have not seen any presentation using

Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object]

2016-03-12 Thread Vinti Maheshwari
Hi All, I wanted to replace my updateStateByKey function with mapWithState function (Spark 1.6) to improve performance of my program. I was following these two documents: https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html

Too Many Accumulators in my Spark Job

2016-03-12 Thread Harshvardhan Chauhan
Hi, My question is about having a lot of counters in spark to keep track of bad/null values in my rdd its descried in detail in below stackoverflow link http://stackoverflow.com/questions/35953400/too-many-accumulators-in-spark-job Posting to the user group to get more traction. Appreciate your

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-12 Thread Mich Talebzadeh
Hi, Thanks for the input. I use Hive 2 and still have this issue. 1. Hive version 2 2. Hive on Spark engine 1.3.1 3. Spark 1.5.2 I have added Hive user group to this as well. So hopefully we may get some resolution. HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-12 Thread Timur Shenkao
Hi, I have suffered from Hive Streaming , Transactions enough, so I can share my experience with you. 1) It's not a problem of Spark. It happens because of "peculiarities" / bugs of Hive Streaming. Hive Streaming, transactions are very raw technologies. If you look at Hive JIRA, you'll see

spark-submit returns nothing with spark 1.6

2016-03-12 Thread Emmanuel
Hello,When i used to submit a job with spark 1.4, it would return a job ID and a status RUNNING, FAILED or something like this.I just upgraded to 1.6 and there is no status returned by spark-submitIs there a way to get this information back? When submit a job i want to know which one it

Re: Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-12 Thread Mich Talebzadeh
This is an interesting one as it appears that a hive transactional table 1. Hive version 2 2. Hive on Spark engine 1.3.1 3. Spark 1.5.2 hive> create table default.foo(id int) clustered by (id) into 2 buckets STORED AS ORC TBLPROPERTIES ('transactional'='true'); hive> insert into

Problem running JavaDirectKafkaWordCount

2016-03-12 Thread Martin Andreoni
Hi, I'm starting with spark and I'm having some issues. When I run the 'jar' example of JavaDirectKafkaWordCount it works perfectly. However, if I compile the code and when I submit it, I'm having the following error: > ERROR ActorSystemImpl: Uncaught fatal error from thread >

Re: NullPointerException

2016-03-12 Thread saurabh guru
I don't see how that would be possible. I am reading from a live stream of data through kafka. On Sat 12 Mar, 2016 20:28 Ted Yu, wrote: > Interesting. > If kv._1 was null, shouldn't the NPE have come from getPartition() (line > 105) ? > > Was it possible that records.next()

Re: NullPointerException

2016-03-12 Thread Ted Yu
Interesting. If kv._1 was null, shouldn't the NPE have come from getPartition() (line 105) ? Was it possible that records.next() returned null ? On Fri, Mar 11, 2016 at 11:20 PM, Prabhu Joseph wrote: > Looking at ExternalSorter.scala line 192, i suspect some input

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
I'm pretty new to all of this stuff, so bare with me. Zeppelin isn't really intended for realtime dashboards as far as I know. Its reporting features (tables, graphs, etc.) are more for displaying the results from the output of something. As far as I know, there isn't really anything to "watch" a

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread trung kien
Thanks Chris and Mich for replying. Sorry for not explaining my problem clearly. Yes i am talking about a flexibke dashboard when mention Zeppelin. Here is the problem i am having: I am running a comercial website where we selle many products and we have many branchs in many place. We have a

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Also adding dev list in case anyone else has ideas / views. On Sat, 12 Mar 2016 at 12:52, Nick Pentreath wrote: > Thanks for the feedback. > > I think Spark can certainly meet your use case when your data size scales > up, as the actual model dimension is very small -

Re: Spark ML - Scaling logistic regression for many features

2016-03-12 Thread Nick Pentreath
Thanks for the feedback. I think Spark can certainly meet your use case when your data size scales up, as the actual model dimension is very small - you will need to use those indexers or some other mapping mechanism. There is ongoing work for Spark 2.0 to make it easier to use models outside of

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
Well, I kind of got it... this works below: * val rdd = sc.newAPIHadoopFile(path, classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable]).map(_._1.datum) rdd .map(item => { val item = i.copy() val record = i._1.datum()

Spark SQL is not returning records for HIVE transactional tables on HDP

2016-03-12 Thread @Sanjiv Singh
Hi All, I am facing this issue on HDP setup on which COMPACTION is required only once for transactional tables to fetch records with Spark SQL. On the other hand, Apache setup doesn't required compaction even once. May be something got triggered on meta-store after compaction, Spark SQL start

Re: Repeating Records w/ Spark + Avro?

2016-03-12 Thread Chris Miller
Wow! That sure is buried in the documentation! But yeah, that's what I thought more or less. I tried copying as follows, but that didn't work. * val copyRDD = singleFileRDD.map(_.copy()) * When I iterate over the new copyRDD (foreach or map), I still have the

Re: Correct way to use spark streaming with apache zeppelin

2016-03-12 Thread Chris Miller
What exactly are you trying to do? Zeppelin is for interactive analysis of a dataset. What do you mean "realtime analytics" -- do you mean build a report or dashboard that automatically updates as new data comes in? -- Chris Miller On Sat, Mar 12, 2016 at 3:13 PM, trung kien