Re: Indexing Support

2015-10-18 Thread Russ Weeks
Distributed R-Trees are not very common. Most "big data" spatial solutions
collapse multi-dimensional data into a distributed one-dimensional index
using a space-filling curve. Many implementations exist outside of Spark
for eg. Hbase or Accumulo. It's simple enough to write a map function that
takes a longitude+latitude pair and converts it to a position on a Z curve,
so you can work with a PairRDD of something like . It's more
complicated to convert a geospatial query expressed as a bounding box into
a set of disjoint curve intervals, but there are good examples out there.
The excellent Accumulo Recipes project has an implementation of such an
algorithm, it would be pretty easy to port it to work with a PairRDD as
described above.

On Sun, Oct 18, 2015 at 3:26 PM Jerry Lam  wrote:

> I'm interested in it but I doubt there is r-tree indexing support in the
> near future as spark is not a database. You might have a better luck
> looking at databases with spatial indexing support out of the box.
>
> Cheers
>
> Sent from my iPad
>
> On 2015-10-18, at 17:16, Mustafa Elbehery 
> wrote:
>
> Hi All,
>
> I am trying to use spark to process *Spatial Data. *I am looking for
> R-Tree Indexing support in best case, but I would be fine with any other
> indexing capability as well, just to improve performance.
>
> Anyone had the same issue before, and is there any information regarding
> Index support in future releases ?!!
>
> Regards.
>
> --
> Mustafa Elbehery
> EIT ICT Labs Master School 
> +49(0)15750363097
> skype: mustafaelbehery87
>
>


Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread ReeceRobinson
Does anyone have some advice on the best way to deploy a Hive UDF for use
with a Spark SQL Thriftserver where the client is Tableau using Simba ODBC
Spark SQL driver.

I have seen the hive documentation that provides an example of creating the
function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass' USING
JAR 'hdfs:///path/to/jar';

However using Tableau I can't run this create function statement to register
my UDF. Ideally there is a configuration setting that will load my UDF jar
and register it at start-up of the thriftserver.

Can anyone tell me what the best option if it is possible?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Igor Berman
thanks Ted :)


On 18 October 2015 at 19:07, Ted Yu  wrote:

> Interesting reading material.
>
> bq. transformations that loose partitioner
>
> lose partitioner
>
> bq. Spark looses the partitioner
>
> loses the partitioner
>
> bq. Tunning number of partitions
>
> Should be tuning.
>
> bq. or increase shuffle fraction
> bq. ShuffleMemoryManager: Thread 61 ...
>
> Hopefully SPARK-1 would alleviate the above situation.
>
> Cheers
>
> On Sun, Oct 18, 2015 at 8:51 AM, igor.berman 
> wrote:
>
>> might be somebody will find it useful
>> goo.gl/0yfvBd
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/our-spark-gotchas-report-while-creating-batch-pipeline-tp25112.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Ted Yu
Umesh:

$ jar tvf
/home/hbase/.m2/repository/org/spark-project/hive/hive-exec/1.2.1.spark/hive-exec-1.2.1.spark.jar
| grep GenericUDAFPercentile
  2143 Fri Jul 31 23:51:48 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$1.class
  4602 Fri Jul 31 23:51:48 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFMultiplePercentileApproxEvaluator.class

As long as the following dependency is in your pom.xml:
[INFO] +- org.spark-project.hive:hive-exec:jar:1.2.1.spark:compile

You should be able to invoke percentile_approx

Cheers

On Sun, Oct 18, 2015 at 8:58 AM, Umesh Kacha  wrote:

> Thanks much Ted so when do we get to use this sparkUdf in Java code using
> maven code dependencies?? You said JIRA 10671 is not pushed as part of
> 1.5.1 so it should be released in 1.6.0 as mentioned in the JIRA right?
>
> On Sun, Oct 18, 2015 at 9:20 PM, Ted Yu  wrote:
>
>> The udf is defined in GenericUDAFPercentileApprox of hive.
>>
>> When spark-shell runs, it has access to the above class which is packaged
>> in assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.0.jar
>> :
>>
>>   2143 Fri Oct 16 15:02:26 PDT 2015
>> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$1.class
>>   4602 Fri Oct 16 15:02:26 PDT 2015
>> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFMultiplePercentileApproxEvaluator.class
>>   1697 Fri Oct 16 15:02:26 PDT 2015
>> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator$PercentileAggBuf.class
>>   6570 Fri Oct 16 15:02:26 PDT 2015
>> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.class
>>   4334 Fri Oct 16 15:02:26 PDT 2015
>> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFSinglePercentileApproxEvaluator.class
>>   6293 Fri Oct 16 15:02:26 PDT 2015
>> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.class
>>
>> That was the cause for different behavior.
>>
>> FYI
>>
>> On Sun, Oct 18, 2015 at 12:10 AM, unk1102  wrote:
>>
>>> Hi starting new thread following old thread looks like code for compiling
>>> callUdf("percentile_approx",col("mycol"),lit(0.25)) is not merged in
>>> spark
>>> 1.5.1 source but I dont understand why this function call works in Spark
>>> 1.5.1 spark-shell/bin. Please guide.
>>>
>>> -- Forwarded message --
>>> From: "Ted Yu" 
>>> Date: Oct 14, 2015 3:26 AM
>>> Subject: Re: How to calculate percentile of a column of DataFrame?
>>> To: "Umesh Kacha" 
>>> Cc: "Michael Armbrust" ,
>>> "saif.a.ell...@wellsfargo.com" ,
>>> "user" 
>>>
>>> I modified DataFrameSuite, in master branch, to call percentile_approx
>>> instead of simpleUDF :
>>>
>>> - deprecated callUdf in SQLContext
>>> - callUDF in SQLContext *** FAILED ***
>>>   org.apache.spark.sql.AnalysisException: undefined function
>>> percentile_approx;
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>>>   at scala.Option.getOrElse(Option.scala:120)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
>>>   at
>>>
>>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>>>
>>> SPARK-10671 is included.
>>> For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL treats
>>> percentile_approx as normal UDF.
>>>
>>> Experts can correct me, if there is any misunderstanding.
>>>
>>> Cheers
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/callUdf-percentile-approx-col-mycol-lit-0-25-does-not-compile-spark-1-5-1-source-but-it-does-work-inn-tp25111.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> 

Re: How VectorIndexer works in Spark ML pipelines

2015-10-18 Thread Jorge Sánchez
Vishnu,

VectorIndexer
 will
add metadata regarding which features are categorical and what are
continuous depending on the threshold, if there are more different unique
values than the *MaxCategories *parameter, they will be treated as
continuous. That will help the learning algorithms as they will be treated
differently.
>From the data I can see you have more than one Vector in the features
column? Try using some Vectors with only two different values.

Regards.

2015-10-15 10:14 GMT+01:00 VISHNU SUBRAMANIAN :

> HI All,
>
> I am trying to use the VectorIndexer (FeatureExtraction) technique
> available from the Spark ML Pipelines.
>
> I ran the example in the documentation .
>
> val featureIndexer = new VectorIndexer()
>   .setInputCol("features")
>   .setOutputCol("indexedFeatures")
>   .setMaxCategories(4)
>   .fit(data)
>
>
> And then I wanted to see what output it generates.
>
> After performing transform on the data set , the output looks like below.
>
> scala> predictions.select("indexedFeatures").take(1).foreach(println)
>
>
> [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]
>
>
> scala> predictions.select("features").take(1).foreach(println)
>
>
> [(692,[124,125,126,127,151,152,153,154,155,179,180,181,182,183,208,209,210,211,235,236,237,238,239,263,264,265,266,267,268,292,293,294,295,296,321,322,323,324,349,350,351,352,377,378,379,380,405,406,407,408,433,434,435,436,461,462,463,464,489,490,491,492,493,517,518,519,520,521,545,546,547,548,549,574,575,576,577,578,602,603,604,605,606,630,631,632,633,634,658,659,660,661,662],[145.0,255.0,211.0,31.0,32.0,237.0,253.0,252.0,71.0,11.0,175.0,253.0,252.0,71.0,144.0,253.0,252.0,71.0,16.0,191.0,253.0,252.0,71.0,26.0,221.0,253.0,252.0,124.0,31.0,125.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,253.0,252.0,252.0,108.0,255.0,253.0,253.0,170.0,253.0,252.0,252.0,252.0,42.0,149.0,252.0,252.0,252.0,144.0,109.0,252.0,252.0,252.0,144.0,218.0,253.0,253.0,255.0,35.0,175.0,252.0,252.0,253.0,35.0,73.0,252.0,252.0,253.0,35.0,31.0,211.0,252.0,253.0,35.0])]
>
> I can,t understand what is happening. I tried with simple data sets also ,
> but similar result.
>
> Please help.
>
> Thanks,
>
> Vishnu
>
>
>
>
>
>
>
>


Re: dataframes and numPartitions

2015-10-18 Thread Jorge Sánchez
Alex,

If not, you can try using the functions coalesce(n) or repartition(n).

As per the API, coalesce will not make a shuffle but repartition will.

Regards.

2015-10-16 0:52 GMT+01:00 Mohammed Guller :

> You may find the spark.sql.shuffle.partitions property useful. The default
> value is 200.
>
>
>
> Mohammed
>
>
>
> *From:* Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
> *Sent:* Wednesday, October 14, 2015 8:14 PM
> *To:* user
> *Subject:* dataframes and numPartitions
>
>
>
> A lot of RDD methods take a numPartitions parameter that lets you specify
> the number of partitions in the result. For example, groupByKey.
>
>
>
> The DataFrame counterparts don't have a numPartitions parameter, e.g.
> groupBy only takes a bunch of Columns as params.
>
>
>
> I understand that the DataFrame API is supposed to be smarter and go
> through a LogicalPlan, and perhaps determine the number of optimal
> partitions for you, but sometimes you want to specify the number of
> partitions yourself. One such use case is when you are preparing to do a
> "merge" join with another dataset that is similarly partitioned with the
> same number of partitions.
>


Indexing Support

2015-10-18 Thread Mustafa Elbehery
Hi All,

I am trying to use spark to process *Spatial Data. *I am looking for R-Tree
Indexing support in best case, but I would be fine with any other indexing
capability as well, just to improve performance.

Anyone had the same issue before, and is there any information regarding
Index support in future releases ?!!

Regards.

-- 
Mustafa Elbehery
EIT ICT Labs Master School 
+49(0)15750363097
skype: mustafaelbehery87


Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near 
future as spark is not a database. You might have a better luck looking at 
databases with spatial indexing support out of the box. 

Cheers

Sent from my iPad

On 2015-10-18, at 17:16, Mustafa Elbehery  wrote:

> Hi All, 
> 
> I am trying to use spark to process Spatial Data. I am looking for R-Tree 
> Indexing support in best case, but I would be fine with any other indexing 
> capability as well, just to improve performance. 
> 
> Anyone had the same issue before, and is there any information regarding 
> Index support in future releases ?!!
> 
> Regards.
> 
> -- 
> Mustafa Elbehery
> EIT ICT Labs Master School
> +49(0)15750363097
> skype: mustafaelbehery87
> 


pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-18 Thread fahad shah
 Hi

I am trying to do pair rdd's, group by the key assign id based on key.
I am using Pyspark with spark 1.3, for some reason, I am getting this
error that I am unable to figure out - any help much appreciated.

Things I tried (but to no effect),

1. make sure I am not doing any conversions on the strings
2. make sure that the fields used in the key are all there  and not
empty string (or else I toss the row out)

My code is along following lines (split is using stringio to parse
csv, header removes the header row and parse_train is putting the 54
fields into named tuple after whitespace/quote removal):

#Error for string argument is thrown on the BB.take(1) where the
groupbykey is evaluated

A = sc.textFile("train.csv").filter(lambda x:not
isHeader(x)).map(split).map(parse_train).filter(lambda x: not x is
None)

A.count()

B = A.map(lambda k:
((k.srch_destination_id,k.srch_length_of_stay,k.srch_booking_window,k.srch_adults_count,
 k.srch_children_count,k.srch_room_count), (k[0:54])))
BB = B.groupByKey()
BB.take(1)


best fahad

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Spark SQL Thriftserver and Hive UDF in Production

2015-10-18 Thread Mohammed Guller
Have you tried registering the function using the Beeline client?

Another alternative would be to create a Spark SQL UDF and launch the Spark SQL 
Thrift server programmatically.

Mohammed

-Original Message-
From: ReeceRobinson [mailto:re...@therobinsons.gen.nz] 
Sent: Sunday, October 18, 2015 8:05 PM
To: user@spark.apache.org
Subject: Spark SQL Thriftserver and Hive UDF in Production

Does anyone have some advice on the best way to deploy a Hive UDF for use with 
a Spark SQL Thriftserver where the client is Tableau using Simba ODBC Spark SQL 
driver.

I have seen the hive documentation that provides an example of creating the 
function using a hive client ie: CREATE FUNCTION myfunc AS 'myclass' USING JAR 
'hdfs:///path/to/jar';

However using Tableau I can't run this create function statement to register my 
UDF. Ideally there is a configuration setting that will load my UDF jar and 
register it at start-up of the thriftserver.

Can anyone tell me what the best option if it is possible?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Thriftserver-and-Hive-UDF-in-Production-tp25114.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: repartition vs partitionby

2015-10-18 Thread shahid ashraf
yes i am trying to do so. but it will try to repartition whole data.. can't
we split a large partition(data skewed partition) into multiple partitions
(any idea on this.).

On Sun, Oct 18, 2015 at 1:55 AM, Adrian Tanase  wrote:

> If the dataset allows it you can try to write a custom partitioner to help
> spark distribute the data more uniformly.
>
> Sent from my iPhone
>
> On 17 Oct 2015, at 16:14, shahid ashraf  wrote:
>
> yes i know about that,its in case to reduce partitions. the point here is
> the data is skewed to few partitions..
>
>
> On Sat, Oct 17, 2015 at 6:27 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com> wrote:
>
>> You can use coalesce function, if you want to reduce the number of
>> partitions. This one minimizes the data shuffle.
>>
>> -Raghav
>>
>> On Sat, Oct 17, 2015 at 1:02 PM, shahid qadri 
>> wrote:
>>
>>> Hi folks
>>>
>>> I need to reparation large set of data around(300G) as i see some
>>> portions have large data(data skew)
>>>
>>> i have pairRDDs [({},{}),({},{}),({},{})]
>>>
>>> what is the best way to solve the the problem
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> with Regards
> Shahid Ashraf
>
>


-- 
with Regards
Shahid Ashraf


Re: Should I convert json into parquet?

2015-10-18 Thread Jörn Franke


Good Formats are Parquet or ORC. Both can be useful with compression, such as 
Snappy.   They are much faster than JSON. however, the table structure is up to 
you and depends on your use case.

> On 17 Oct 2015, at 23:07, Gavin Yue  wrote:
> 
> I have json files which contains timestamped events.  Each event associate 
> with a user id. 
> 
> Now I want to group by user id. So converts from
> 
> Event1 -> UserIDA;
> Event2 -> UserIDA;
> Event3 -> UserIDB;
> 
> To intermediate storage. 
> UserIDA -> (Event1, Event2...) 
> UserIDB-> (Event3...) 
> 
> Then I will label positives and featurize the Events Vector in many different 
> ways, fit each of them into the Logistic Regression. 
> 
> I want to save intermediate storage permanently since it will be used many 
> times.  And there will new events coming every day. So I need to update this 
> intermediate storage every day. 
> 
> Right now I store intermediate storage using Json files.  Should I use 
> Parquet instead?  Or is there better solutions for this use case?
> 
> Thanks a lot !
> 
> 
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: In-memory computing and cache() in Spark

2015-10-18 Thread Sonal Goyal
Hi Jia,

RDDs are cached on the executor, not on the driver. I am assuming you are
running locally and haven't changed spark.executor.memory?

Sonal
On Oct 19, 2015 1:58 AM, "Jia Zhan"  wrote:

Anyone has any clue what's going on.? Why would caching with 2g memory much
faster than with 15g memory?

Thanks very much!

On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan  wrote:

> Hi all,
>
> I am running Spark locally in one node and trying to sweep the memory size
> for performance tuning. The machine has 8 CPUs and 16G main memory, the
> dataset in my local disk is about 10GB. I have several quick questions and
> appreciate any comments.
>
> 1. Spark performs in-memory computing, but without using RDD.cache(), will
> anything be cached in memory at all? My guess is that, without RDD.cache(),
> only a small amount of data will be stored in OS buffer cache, and every
> iteration of computation will still need to fetch most data from disk every
> time, is that right?
>
> 2. To evaluate how caching helps with iterative computation, I wrote a
> simple program as shown below, which basically consists of one saveAsText()
> and three reduce() actions/stages. I specify "spark.driver.memory" to
> "15g", others by default. Then I run three experiments.
>
> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>
>*val* *sc* = *new* *SparkContext*(conf)
>
>*val* *input* = sc.textFile(*"/InputFiles"*)
>
>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>
>   *val* *ITERATIONS* = *3*
>
>   *for* (i *<-* *1* to *ITERATIONS*) {
>
>   *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
> )).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>
>   }
>
> (I) The first run: no caching at all. The application finishes in ~12
> minutes (2.6min+3.3min+3.2min+3.3min)
>
> (II) The second run, I modified the code so that the input will be cached:
>  *val input = sc.textFile("/InputFiles").cache()*
>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>  The storage page in Web UI shows 48% of the dataset  is cached, which
> makes sense due to large java object overhead, and
> spark.storage.memoryFraction is 0.6 by default.
>
> (III) However, the third run, same program as the second one, but I
> changed "spark.driver.memory" to be "2g".
>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
> And UI shows 6% of the data is cached.
>
> *From the results we can see the reduce stages finish in seconds, how
> could that happen with only 6% cached? Can anyone explain?*
>
> I am new to Spark and would appreciate any help on this. Thanks!
>
> Jia
>
>
>
>


-- 
Jia Zhan


Re: In-memory computing and cache() in Spark

2015-10-18 Thread Jia Zhan
Anyone has any clue what's going on.? Why would caching with 2g memory much
faster than with 15g memory?

Thanks very much!

On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan  wrote:

> Hi all,
>
> I am running Spark locally in one node and trying to sweep the memory size
> for performance tuning. The machine has 8 CPUs and 16G main memory, the
> dataset in my local disk is about 10GB. I have several quick questions and
> appreciate any comments.
>
> 1. Spark performs in-memory computing, but without using RDD.cache(), will
> anything be cached in memory at all? My guess is that, without RDD.cache(),
> only a small amount of data will be stored in OS buffer cache, and every
> iteration of computation will still need to fetch most data from disk every
> time, is that right?
>
> 2. To evaluate how caching helps with iterative computation, I wrote a
> simple program as shown below, which basically consists of one saveAsText()
> and three reduce() actions/stages. I specify "spark.driver.memory" to
> "15g", others by default. Then I run three experiments.
>
> *   val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*)
>
>*val* *sc* = *new* *SparkContext*(conf)
>
>*val* *input* = sc.textFile(*"/InputFiles"*)
>
>   *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word
> *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*)
>
>   *val* *ITERATIONS* = *3*
>
>   *for* (i *<-* *1* to *ITERATIONS*) {
>
>   *val* *totallength* = input.filter(line*=>*line.contains(*"the"*
> )).map(s*=>*s.length).reduce((a,b)*=>*a+b)
>
>   }
>
> (I) The first run: no caching at all. The application finishes in ~12
> minutes (2.6min+3.3min+3.2min+3.3min)
>
> (II) The second run, I modified the code so that the input will be cached:
>  *val input = sc.textFile("/InputFiles").cache()*
>  The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)!
>  The storage page in Web UI shows 48% of the dataset  is cached, which
> makes sense due to large java object overhead, and
> spark.storage.memoryFraction is 0.6 by default.
>
> (III) However, the third run, same program as the second one, but I
> changed "spark.driver.memory" to be "2g".
>The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!!
> And UI shows 6% of the data is cached.
>
> *From the results we can see the reduce stages finish in seconds, how
> could that happen with only 6% cached? Can anyone explain?*
>
> I am new to Spark and would appreciate any help on this. Thanks!
>
> Jia
>
>
>
>


-- 
Jia Zhan


Re: Spark handling parallel requests

2015-10-18 Thread tarek.abouzeid91
hi Akhlis 
its a must to push data to a socket as i am using php as a web service to push 
data to socket , then spark catch the data on that socket and process it , is 
there a way to push data from php to kafka directly ? --  Best Regards, -- 
Tarek Abouzeid 


 On Sunday, October 18, 2015 10:26 AM, "tarek.abouzei...@yahoo.com" 
 wrote:
   

 hi Xiao,1- requests are not similar at all , but they use solr and do commit 
sometimes 2- no caching is required3- the throughput must be very high yeah , 
the requests are tiny but the system may receive 100 request/sec , does kafka 
support listening to a socket ? --  Best Regards, -- Tarek Abouzeid 


 On Monday, October 12, 2015 10:50 AM, Xiao Li  wrote:
   

 Hi, Tarek, 
It is hard to answer your question. Are these requests similar? Caching your 
results or intermediate results in your applications? Or does that mean your 
throughput requirement is very high? Throttling the number of concurrent 
requests? ...
As Akhil said, Kafka might help in your case. Otherwise, you need to read the 
designs or even source codes of Kafka and Spark Streaming. 
 Best wishes, 
Xiao Li

2015-10-11 23:19 GMT-07:00 Akhil Das :

Instead of pushing your requests to the socket, why don't you push them to a 
Kafka or any other message queue and use spark streaming to process them?
ThanksBest Regards
On Mon, Oct 5, 2015 at 6:46 PM,  wrote:

Hi ,
i am using Scala , doing a socket program to catch multiple requests at same 
time and then call a function which uses spark to handle each process , i have 
a multi-threaded server to handle the multiple requests and pass each to spark 
, but there's a bottleneck as the spark doesn't initialize a sub task for the 
new request , is it even possible to do parallel processing using single spark 
job ?Best Regards, --  Best Regards, -- Tarek Abouzeid





   

  

REST api to avoid spark context creation

2015-10-18 Thread anshu shukla
I have a web based appllication for  analytics over the data stored in
Hbase .Every time User can query about any fix time duration data.But the
response time to that query is about ~ 40 sec.On every request most of time
is wasted in Context creation and Job submission .

1-How can i avoid context creation for every Job.
2-Can i have something like pool to serve requests .

-- 
Thanks & Regards,
Anshu Shukla


Spark Streaming - use the data in different jobs

2015-10-18 Thread Oded Maimon
Hi,
we've build a spark streaming process that get data from a pub/sub
(rabbitmq in our case).

now we want the streamed data to be used in different spark jobs (also in
realtime if possible)

what options do we have for doing that ?


   - can the streaming process and different spark jobs share/access the
   same RDD's?
   - can the streaming process create a sparkSQL table and other jobs
   read/use it?
   - can a spark streaming process trigger other spark jobs and send the
   the data (in memory)?
   - can a spark streaming process cache the data in memory and other
   scheduled jobs access same rdd's?
   - should we keep the data to hbase and read it from other jobs?
   - other ways?


I believe that the answer will be using external db/storage..  hoping to
have a different solution :)

Thanks.


Regards,
Oded Maimon
Scene53.

-- 


*This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are 
addressed. Please note that any disclosure, copying or distribution of the 
content of this information is strictly forbidden. If you have received 
this email message in error, please destroy it immediately and notify its 
sender.*


No suitable Constructor found while compiling

2015-10-18 Thread VJ Anand
I am trying to extend RDD in java, and when I call the parent constructor,
it gives the error: no suitable constructor found for RDD (SparkContext,
Seq, ClassTag).

 Here is the snippet of the code:

class QueryShard extends RDD {



sc (sc, (Seq)new ArrayBuffer,
ClassTag$.MODULE$.apply(Tuple.class);

}

Compiler error: no suitable constructor found for RDD (SparkContext, Seq,
ClassTag)

Any thoughts? pointers..

Thanks
VJ


Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Ted Yu
The udf is defined in GenericUDAFPercentileApprox of hive.

When spark-shell runs, it has access to the above class which is packaged
in assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.0.jar
:

  2143 Fri Oct 16 15:02:26 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$1.class
  4602 Fri Oct 16 15:02:26 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFMultiplePercentileApproxEvaluator.class
  1697 Fri Oct 16 15:02:26 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator$PercentileAggBuf.class
  6570 Fri Oct 16 15:02:26 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.class
  4334 Fri Oct 16 15:02:26 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFSinglePercentileApproxEvaluator.class
  6293 Fri Oct 16 15:02:26 PDT 2015
org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.class

That was the cause for different behavior.

FYI

On Sun, Oct 18, 2015 at 12:10 AM, unk1102  wrote:

> Hi starting new thread following old thread looks like code for compiling
> callUdf("percentile_approx",col("mycol"),lit(0.25)) is not merged in spark
> 1.5.1 source but I dont understand why this function call works in Spark
> 1.5.1 spark-shell/bin. Please guide.
>
> -- Forwarded message --
> From: "Ted Yu" 
> Date: Oct 14, 2015 3:26 AM
> Subject: Re: How to calculate percentile of a column of DataFrame?
> To: "Umesh Kacha" 
> Cc: "Michael Armbrust" ,
> "saif.a.ell...@wellsfargo.com" ,
> "user" 
>
> I modified DataFrameSuite, in master branch, to call percentile_approx
> instead of simpleUDF :
>
> - deprecated callUdf in SQLContext
> - callUDF in SQLContext *** FAILED ***
>   org.apache.spark.sql.AnalysisException: undefined function
> percentile_approx;
>   at
>
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>   at
>
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>   at scala.Option.getOrElse(Option.scala:120)
>   at
>
> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
>   at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>   at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>   at
>
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>   at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
>   at
>
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
>   at
>
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>
> SPARK-10671 is included.
> For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL treats
> percentile_approx as normal UDF.
>
> Experts can correct me, if there is any misunderstanding.
>
> Cheers
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/callUdf-percentile-approx-col-mycol-lit-0-25-does-not-compile-spark-1-5-1-source-but-it-does-work-inn-tp25111.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: callUdf("percentile_approx",col("mycol"),lit(0.25)) does not compile spark 1.5.1 source but it does work in spark 1.5.1 bin

2015-10-18 Thread Umesh Kacha
Thanks much Ted so when do we get to use this sparkUdf in Java code using
maven code dependencies?? You said JIRA 10671 is not pushed as part of
1.5.1 so it should be released in 1.6.0 as mentioned in the JIRA right?

On Sun, Oct 18, 2015 at 9:20 PM, Ted Yu  wrote:

> The udf is defined in GenericUDAFPercentileApprox of hive.
>
> When spark-shell runs, it has access to the above class which is packaged
> in assembly/target/scala-2.10/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.0.jar
> :
>
>   2143 Fri Oct 16 15:02:26 PDT 2015
> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$1.class
>   4602 Fri Oct 16 15:02:26 PDT 2015
> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFMultiplePercentileApproxEvaluator.class
>   1697 Fri Oct 16 15:02:26 PDT 2015
> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator$PercentileAggBuf.class
>   6570 Fri Oct 16 15:02:26 PDT 2015
> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFPercentileApproxEvaluator.class
>   4334 Fri Oct 16 15:02:26 PDT 2015
> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox$GenericUDAFSinglePercentileApproxEvaluator.class
>   6293 Fri Oct 16 15:02:26 PDT 2015
> org/apache/hadoop/hive/ql/udf/generic/GenericUDAFPercentileApprox.class
>
> That was the cause for different behavior.
>
> FYI
>
> On Sun, Oct 18, 2015 at 12:10 AM, unk1102  wrote:
>
>> Hi starting new thread following old thread looks like code for compiling
>> callUdf("percentile_approx",col("mycol"),lit(0.25)) is not merged in spark
>> 1.5.1 source but I dont understand why this function call works in Spark
>> 1.5.1 spark-shell/bin. Please guide.
>>
>> -- Forwarded message --
>> From: "Ted Yu" 
>> Date: Oct 14, 2015 3:26 AM
>> Subject: Re: How to calculate percentile of a column of DataFrame?
>> To: "Umesh Kacha" 
>> Cc: "Michael Armbrust" ,
>> "saif.a.ell...@wellsfargo.com" ,
>> "user" 
>>
>> I modified DataFrameSuite, in master branch, to call percentile_approx
>> instead of simpleUDF :
>>
>> - deprecated callUdf in SQLContext
>> - callUDF in SQLContext *** FAILED ***
>>   org.apache.spark.sql.AnalysisException: undefined function
>> percentile_approx;
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry$$anonfun$2.apply(FunctionRegistry.scala:64)
>>   at scala.Option.getOrElse(Option.scala:120)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:63)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
>>   at
>>
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
>>   at
>>
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>>
>> SPARK-10671 is included.
>> For 1.5.1, I guess the absence of SPARK-10671 means that SparkSQL treats
>> percentile_approx as normal UDF.
>>
>> Experts can correct me, if there is any misunderstanding.
>>
>> Cheers
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/callUdf-percentile-approx-col-mycol-lit-0-25-does-not-compile-spark-1-5-1-source-but-it-does-work-inn-tp25111.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Ted Yu
Interesting reading material.

bq. transformations that loose partitioner

lose partitioner

bq. Spark looses the partitioner

loses the partitioner

bq. Tunning number of partitions

Should be tuning.

bq. or increase shuffle fraction
bq. ShuffleMemoryManager: Thread 61 ...

Hopefully SPARK-1 would alleviate the above situation.

Cheers

On Sun, Oct 18, 2015 at 8:51 AM, igor.berman  wrote:

> might be somebody will find it useful
> goo.gl/0yfvBd
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/our-spark-gotchas-report-while-creating-batch-pipeline-tp25112.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: REST api to avoid spark context creation

2015-10-18 Thread Raghavendra Pandey
You may like to look at spark job server.
https://github.com/spark-jobserver/spark-jobserver

Raghavendra


our spark gotchas report while creating batch pipeline

2015-10-18 Thread igor.berman
might be somebody will find it useful 
goo.gl/0yfvBd



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/our-spark-gotchas-report-while-creating-batch-pipeline-tp25112.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: No suitable Constructor found while compiling

2015-10-18 Thread Ted Yu
I see two argument ctor. e.g.

  /** Construct an RDD with just a one-to-one dependency on one parent */
  def this(@transient oneParent: RDD[_]) =
this(oneParent.context , List(new OneToOneDependency(oneParent)))

Looks like Tuple in your code is T in the following:

abstract class RDD[T: ClassTag](
@transient private var _sc: SparkContext,
@transient private var deps: Seq[Dependency[_]]

FYI

On Sun, Oct 18, 2015 at 6:39 AM, VJ Anand  wrote:

> I am trying to extend RDD in java, and when I call the parent constructor,
> it gives the error: no suitable constructor found for RDD (SparkContext,
> Seq, ClassTag).
>
>  Here is the snippet of the code:
>
> class QueryShard extends RDD {
>
> 
>
> sc (sc, (Seq)new ArrayBuffer,
> ClassTag$.MODULE$.apply(Tuple.class);
>
> }
>
> Compiler error: no suitable constructor found for RDD (SparkContext, Seq,
> ClassTag)
>
> Any thoughts? pointers..
>
> Thanks
> VJ
>
>
>


Pass spark partition explicitly ?

2015-10-18 Thread kali.tumm...@gmail.com
Hi All, 

can I pass number of partitions to all the RDD explicitly while submitting
the spark Job or di=o I need to mention in my spark code itself ?

Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pass-spark-partition-explicitly-tp25113.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Pass spark partition explicitly ?

2015-10-18 Thread Richard Eggert
If you want to override the default partitioning behavior,  you have to do
so in your code where you create each RDD. Different RDDs usually have
different numbers of partitions (except when one RDD is directly derived
from another without shuffling) because they usually have different sizes,
so it wouldn't make sense to have some sort of "global" notion of how many
partitions to create.  You could,  if you wanted,  pass partition counts in
as command line options to your application and use those values in your
code that creates the RDDs, of course.

Rich
On Oct 18, 2015 1:57 PM, "kali.tumm...@gmail.com" 
wrote:

> Hi All,
>
> can I pass number of partitions to all the RDD explicitly while submitting
> the spark Job or di=o I need to mention in my spark code itself ?
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Pass-spark-partition-explicitly-tp25113.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Pass spark partition explicitly ?

2015-10-18 Thread sri hari kali charan Tummala
Hi Richard,

Thanks so my take from your discussion is we want pass explicitly partition
values it have to be written inside the code.

Thanks
Sri

On Sun, Oct 18, 2015 at 7:05 PM, Richard Eggert 
wrote:

> If you want to override the default partitioning behavior,  you have to do
> so in your code where you create each RDD. Different RDDs usually have
> different numbers of partitions (except when one RDD is directly derived
> from another without shuffling) because they usually have different sizes,
> so it wouldn't make sense to have some sort of "global" notion of how many
> partitions to create.  You could,  if you wanted,  pass partition counts in
> as command line options to your application and use those values in your
> code that creates the RDDs, of course.
>
> Rich
> On Oct 18, 2015 1:57 PM, "kali.tumm...@gmail.com" 
> wrote:
>
>> Hi All,
>>
>> can I pass number of partitions to all the RDD explicitly while submitting
>> the spark Job or di=o I need to mention in my spark code itself ?
>>
>> Thanks
>> Sri
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Pass-spark-partition-explicitly-tp25113.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


-- 
Thanks & Regards
Sri Tummala