Re: Indexing Support

2015-10-18 Thread Russ Weeks
Distributed R-Trees are not very common. Most "big data" spatial solutions collapse multi-dimensional data into a distributed one-dimensional index using a space-filling curve. Many implementations exist outside of Spark for eg. Hbase or Accumulo. It's simple enough to write a map function that

Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread ReeceRobinson
Does anyone have some advice on the best way to deploy a Hive UDF for use with a Spark SQL Thriftserver where the client is Tableau using Simba ODBC Spark SQL driver. I have seen the hive documentation that provides an example of creating the function using a hive client ie: CREATE FUNCTION

Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Igor Berman
thanks Ted :) On 18 October 2015 at 19:07, Ted Yu wrote: > Interesting reading material. > > bq. transformations that loose partitioner > > lose partitioner > > bq. Spark looses the partitioner > > loses the partitioner > > bq. Tunning number of partitions > > Should be

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Ted Yu
Umesh: $ jar tvf /home/hbase/.m2/repository/org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark.jar | grep GenericUDAFPercentile 2143 Fri Jul 31 23:51:48 PDT 2015 org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$1.class 4602 Fri Jul 31 23:51:48 PDT 2015

Re: How VectorIndexer works in Spark ML pipelines

2015-10-18 Thread Jorge Sánchez
Vishnu, VectorIndexer will add metadata regarding which features are categorical and what are continuous depending on the threshold, if there are more different unique values than the *MaxCategories *parameter, they will be

Re: dataframes and numPartitions

2015-10-18 Thread Jorge Sánchez
Alex, If not, you can try using the functions coalesce(n) or repartition(n). As per the API, coalesce will not make a shuffle but repartition will. Regards. 2015-10-16 0:52 GMT+01:00 Mohammed Guller : > You may find the spark.sql.shuffle.partitions property useful. The

Indexing Support

2015-10-18 Thread Mustafa Elbehery
Hi All, I am trying to use spark to process *Spatial Data. *I am looking for R-Tree Indexing support in best case, but I would be fine with any other indexing capability as well, just to improve performance. Anyone had the same issue before, and is there any information regarding Index support

Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near future as spark is not a database. You might have a better luck looking at databases with spatial indexing support out of the box. Cheers Sent from my iPad On 2015-10-18, at 17:16, Mustafa Elbehery

pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-18 Thread fahad shah
Hi I am trying to do pair rdd's, group by the key assign id based on key. I am using Pyspark with spark 1.3, for some reason, I am getting this error that I am unable to figure out - any help much appreciated. Things I tried (but to no effect), 1. make sure I am not doing any conversions on

RE: Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread Mohammed Guller
Have you tried registering the function using the Beeline client? Another alternative would be to create a Spark SQL UDF and launch the Spark SQL Thrift server programmatically. Mohammed -Original Message- From: ReeceRobinson [mailto:re...@therobinsons.gen.nz] Sent: Sunday, October

Re: repartition vs partitionby

2015-10-18 Thread shahid ashraf
yes i am trying to do so. but it will try to repartition whole data.. can't we split a large partition(data skewed partition) into multiple partitions (any idea on this.). On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tanase wrote: > If the dataset allows it you can try to write a

Re: Should I convert json into parquet?

2015-10-18 Thread Jörn Franke
Good Formats are Parquet or ORC. Both can be useful with compression, such as Snappy. They are much faster than JSON. however, the table structure is up to you and depends on your use case. > On 17 Oct 2015, at 23:07, Gavin Yue wrote: > > I have json files which

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Sonal Goyal
Hi Jia, RDDs are cached on the executor, not on the driver. I am assuming you are running locally and haven't changed spark.executor.memory? Sonal On Oct 19, 2015 1:58 AM, "Jia Zhan" wrote: Anyone has any clue what's going on.? Why would caching with 2g memory much faster

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Jia Zhan
Anyone has any clue what's going on.? Why would caching with 2g memory much faster than with 15g memory? Thanks very much! On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for

Re: Spark handling parallel requests

2015-10-18 Thread tarek.abouzeid91
hi Akhlis  its a must to push data to a socket as i am using php as a web service to push data to socket , then spark catch the data on that socket and process it , is there a way to push data from php to kafka directly ? --  Best Regards, -- Tarek Abouzeid On Sunday, October 18, 2015

REST api to avoid spark context creation

2015-10-18 Thread anshu shukla
I have a web based appllication for analytics over the data stored in Hbase .Every time User can query about any fix time duration data.But the response time to that query is about ~ 40 sec.On every request most of time is wasted in Context creation and Job submission . 1-How can i avoid context

Spark Streaming - use the data in different jobs

2015-10-18 Thread Oded Maimon
Hi, we've build a spark streaming process that get data from a pub/sub (rabbitmq in our case). now we want the streamed data to be used in different spark jobs (also in realtime if possible) what options do we have for doing that ? - can the streaming process and different spark jobs

No suitable Constructor found while compiling

2015-10-18 Thread VJ Anand
I am trying to extend RDD in java, and when I call the parent constructor, it gives the error: no suitable constructor found for RDD (SparkContext, Seq, ClassTag). Here is the snippet of the code: class QueryShard extends RDD { sc (sc, (Seq)new ArrayBuffer,

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Ted Yu
The udf is defined in GenericUDAFPercentileApprox of hive. When spark-shell runs, it has access to the above class which is packaged in assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.0.jar : 2143 Fri Oct 16 15:02:26 PDT 2015

Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Umesh Kacha
Thanks much Ted so when do we get to use this sparkUdf in Java code using maven code dependencies?? You said JIRA 10671 is not pushed as part of 1.5.1 so it should be released in 1.6.0 as mentioned in the JIRA right? On Sun, Oct 18, 2015 at 9:20 PM, Ted Yu wrote: > The udf

Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Ted Yu
Interesting reading material. bq. transformations that loose partitioner lose partitioner bq. Spark looses the partitioner loses the partitioner bq. Tunning number of partitions Should be tuning. bq. or increase shuffle fraction bq. ShuffleMemoryManager: Thread 61 ... Hopefully SPARK-1

Re: REST api to avoid spark context creation

2015-10-18 Thread Raghavendra Pandey
You may like to look at spark job server. https://github.com/spark-jobserver/spark-jobserver Raghavendra

our spark gotchas report while creating batch pipeline

2015-10-18 Thread igor.berman
might be somebody will find it useful goo.gl/0yfvBd -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/our-spark-gotchas-report-while-creating-batch-pipeline-tp25112.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: No suitable Constructor found while compiling

2015-10-18 Thread Ted Yu
I see two argument ctor. e.g. /** Construct an RDD with just a one-to-one dependency on one parent */ def this(@transient oneParent: RDD[_]) = this(oneParent.context , List(new OneToOneDependency(oneParent))) Looks like Tuple in your code is T in the following: abstract class RDD[T:

Pass spark partition explicitly ?

2015-10-18 Thread kali.tumm...@gmail.com
Hi All, can I pass number of partitions to all the RDD explicitly while submitting the spark Job or di=o I need to mention in my spark code itself ? Thanks Sri -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pass-spark-partition-explicitly-tp25113.html

Re: Pass spark partition explicitly ?

2015-10-18 Thread Richard Eggert
If you want to override the default partitioning behavior, you have to do so in your code where you create each RDD. Different RDDs usually have different numbers of partitions (except when one RDD is directly derived from another without shuffling) because they usually have different sizes, so

Re: Pass spark partition explicitly ?

2015-10-18 Thread sri hari kali charan Tummala
Hi Richard, Thanks so my take from your discussion is we want pass explicitly partition values it have to be written inside the code. Thanks Sri On Sun, Oct 18, 2015 at 7:05 PM, Richard Eggert wrote: > If you want to override the default partitioning behavior, you