[HELP:]Save Spark Dataframe in Phoenix Table

2016-04-07 Thread Divya Gehlot
Hi, I hava a Hortonworks Hadoop cluster having below Configurations : Spark 1.5.2 HBASE 1.1.x Phoenix 4.4 I am able to connect to Phoenix through JDBC connection and able to read the Phoenix tables . But while writing the data back to Phoenix table I am getting below error :

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread ashesh_28
Hi , I am also attaching a screenshot of my ResourceManager UI which shows the available cores and memory allocated for each node , -- View this message in context:

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread ashesh_28
Hi Guys , Thanks for your valuable inputs , I have tried few alternatives as suggested but it all leads me to same result - Unable to start Spark Context @Dhiraj Peechara I am able to start my spark SC(SparkContext) in stand-alone mode by just issuing the *$spark-shell* command from the

Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Mich Talebzadeh
If you are using hiveContext to create a Hive database it will work. In general you should use Hive to create a Hive database and create tables within the already existing Hive database from Spark. Make sure that you qualify with sql("DROP TABLE IF EXISTS accounts.ll_18740868") var sqltext :

MLlib ALS MatrixFactorizationModel.save fails consistently

2016-04-07 Thread Colin Woodbury
Hi all, I've implemented most of a content recommendation system for a client. However, whenever I attempt to save a MatrixFactorizationModel I've trained, I see one of four outcomes: 1. Despite "save" being wrapped in a "try" block, I see a massive stack trace quoting some java.io classes. The

Re: ordering over structs

2016-04-07 Thread Imran Akbar
thanks Michael, I'm trying to implement the code in pyspark like so (where my dataframe has 3 columns - customer_id, dt, and product): st = StructType().add("dt", DateType(), True).add("product", StringType(), True) top = data.select("customer_id", st.alias('vs')) .groupBy("customer_id")

Re: Anyone have a tutorial or guide to implement Spark + AWS + Caffe/CUDA?

2016-04-07 Thread jamborta
Hi Alfredo, I have been building something similar and found that EMR is not suitable for this, as the gpu instances don't come with nvidia drivers (and the bootstrap process does not allow to restart instances). The way I'm setting up is based on the spark-ec2 script where you can use custom

Re: Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread Xiao Li
Hi, Assuming you are using 1.6 or before, this is a native Hive command. Basically, the execution of Database creation is completed by Hive. Thanks, Xiao Li 2016-04-07 15:23 GMT-07:00 antoniosi : > Hi, > > I am using hiveContext.sql("create database if not exists ") to

Is Hive CREATE DATABASE IF NOT EXISTS atomic

2016-04-07 Thread antoniosi
Hi, I am using hiveContext.sql("create database if not exists ") to create a hive db. Is this statement atomic? Thanks. Antonio. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Hive-CREATE-DATABASE-IF-NOT-EXISTS-atomic-tp26706.html Sent from the

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-07 Thread Ted Yu
Which Spark release are you using ? Have you registered to all the events provided by SparkListener ? If so, can you do event-wise summation of execution time ? Thanks On Thu, Apr 7, 2016 at 11:03 AM, JasmineGeorge wrote: > We are running a batch job with the following

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-07 Thread JasmineGeorge
The logs are self explanatory. It says "java.io.IOException: Incomplete HDFS URI, no host: hdfs:/user/hduser/share/lib/spark-assembly.jar" you need to specify the host in the above hdfs url. It should look something like the following: hdfs://:8020/user/hduser/share/lib/spark-assembly.jar

Working with zips in pyspark

2016-04-07 Thread tminima
I have n zips in a directory and I want to extract each one of those and then get some data out of a file or two lying inside the zips and add it to a graph DB. All of my zips are in a HDFS directory. I am thinking my code should be along these lines. # Names of all my zips zip_names =

Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Nirmal Manoharan
Hi Greg, I use something similar to this in my application but not for empty string. So the below example is not tested but it should work. JavaRDD filteredJavaRDD = example.filter(new Function(){ public Boolean call(String arg0) throws Exception { return (!arg0.equals("")); } });

RE: mapWithState not compacting removed state

2016-04-07 Thread Iain Cundy
Hi Ofir I've discovered compaction works in 1.6.0 if I switch off Kryo. I was using a workaround to get around mapWithState not supporting Kryo. See https://issues.apache.org/jira/browse/SPARK-12591 My custom KryoRegistrator Java class has // workaround until bug fixes in spark 1.6.1

Re: HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Nick Pentreath
You're right Sean, the implementation depends on hash code currently so may differ. I opened a JIRA (which duplicated this one - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574 which is the active JIRA), for using murmurhash3 which should then be consistent across platforms

Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Ted Yu
This is the version of Kafka Spark depends on: [INFO] +- org.apache.kafka:kafka_2.10:jar:0.8.2.1:compile On Thu, Apr 7, 2016 at 9:14 AM, Haroon Rasheed wrote: > Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0" > compile. I guess the internal

Re: building kafka project on intellij Help is much appreciated

2016-04-07 Thread Haroon Rasheed
Try removing libraryDependencies += "org.apache.kafka" %% "kafka" % "1.6.0" compile. I guess the internal dependencies are automatically pulled when you add spark-streaming-kafka_2.10. Also try changing the version to 1.6.1 or lower. Just to see if the links are broken. Regards, Haroon Syed On

building kafka project on intellij Help is much appreciated

2016-04-07 Thread Sudhanshu Janghel
Hello, I am new to building kafka and wish to understand how to make fat jars in intellij. The sbt assembly seems confusing and I am unable to resolve the dependencies. here is my build.sbt name := "twitter" version := "1.0" scalaVersion := "2.10.4" //libraryDependencies += "org.slf4j" %

Re: Dataframe to parquet using hdfs or parquet block size

2016-04-07 Thread Buntu Dev
I tried setting both the hdfs and parquet block size but write to parquet did not seem to have had any effect on the total number of blocks or the average block size. Here is what I did: sqlContext.setConf("dfs.blocksize", "134217728") sqlContext.setConf("parquet.block.size", "134217728")

Re: mapWithState not compacting removed state

2016-04-07 Thread Ofir Kerker
Hi Iain, Did you manage to solve this issue? It looks like we have a similar issue with processing time increasing every micro-batch but only after 30 batches. Thanks. On Thu, Mar 3, 2016 at 4:45 PM Iain Cundy wrote: > Hi All > > > > I’m aggregating data using

HashingTF "compatibility" across Python, Scala?

2016-04-07 Thread Sean Owen
Let's say I use HashingTF in my Pipeline to hash a string feature. This is available in Python and Scala, but they hash strings to different values since both use their respective runtime's native hash implementation. This means that I create different feature vectors for the same input. While I

Re: Spark on Mobile platforms

2016-04-07 Thread Luciano Resende
Take a look at Apache Quarks, it is more towards what you are looking for and has the ability to integrate with Spark. http://quarks.apache.org/ On Thu, Apr 7, 2016 at 4:50 AM, sakilQUB wrote: > Hi all, > > I have been trying to find if Spark can be run on a mobile

Re: How to process one partition at a time?

2016-04-07 Thread Andrei
Thanks everyone, both - `submitJob` and `PartitionPrunningRDD` - work for me. On Thu, Apr 7, 2016 at 8:22 AM, Hemant Bhanawat wrote: > Apparently, there is another way to do it. You can try creating a > PartitionPruningRDD and pass a partition filter function to it. This

Re: Spark on Mobile platforms

2016-04-07 Thread Michael Slavitch
You should consider mobile agents that feed data into a spark datacenter via spark streaming. > On Apr 7, 2016, at 8:28 AM, Ashic Mahtab wrote: > > Spark may not be the right tool for this. Working on just the mobile device, > you won't be scaling out stuff, and as such most

Re: How to remove empty strings from JavaRDD

2016-04-07 Thread Chris Miller
flatmap? -- Chris Miller On Thu, Apr 7, 2016 at 10:25 PM, greg huang wrote: > Hi All, > >Can someone give me a example code to get rid of the empty string in > JavaRDD? I kwon there is a filter method in JavaRDD: >

How to remove empty strings from JavaRDD

2016-04-07 Thread greg huang
Hi All, Can someone give me a example code to get rid of the empty string in JavaRDD? I kwon there is a filter method in JavaRDD: https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/rdd/RDD.html#filter(scala.Function1) Regards, Greg

RE: Spark on Mobile platforms

2016-04-07 Thread Ashic Mahtab
Spark may not be the right tool for this. Working on just the mobile device, you won't be scaling out stuff, and as such most of the benefits of Spark would be nullified. Moreover, it'd likely run slower than things that are meant to work in a single process. Spark is also quite large, which is

difference between simple streaming and windows streaming in spark

2016-04-07 Thread Ashok Kumar
Is simple streaming mean continuous streaming and windows streaming time window? val ssc = new StreamingContext(sparkConf, Seconds(10)) thanks

Spark on Mobile platforms

2016-04-07 Thread sakilQUB
Hi all, I have been trying to find if Spark can be run on a mobile device platform (Android preferably) to analyse mobile log data for some performance analysis. So, basically the idea is to collect and process the mobile log data within the mobile device using the Spark framework to allow

Re: partition an empty RDD

2016-04-07 Thread Tenghuan He
Thanks for your response Owen:) Yes, I define K as ClassTag type and it works. Sorry for bothering. On Thu, Apr 7, 2016 at 4:07 PM, Sean Owen wrote: > It means pretty much what it says. Your code does not have runtime > class info about K at this point in your code, and it

Develop locally with Yarn

2016-04-07 Thread Natu Lauchande
Hi, I working on a spark streamming app , when in local i use the "local[*]" as the master of my Spark Streamming Context . I wonder what would be need to develop locally and run it in Yarn through the IDE i am using IntelliJ idea. Thanks, Natu

Re: LabeledPoint with features in matrix form (word2vec matrix)

2016-04-07 Thread jamborta
depends, if you'd like to multiply matrices for each row in the data, then you could use a breeze matrix, and do that locally on the nodes in a map or similar. if you'd like to multiply them across the rows, eg. a row in your data is a row in the matrix, then you could use a distributed matrix

Re: partition an empty RDD

2016-04-07 Thread Sean Owen
It means pretty much what it says. Your code does not have runtime class info about K at this point in your code, and it is required. On Thu, Apr 7, 2016 at 5:52 AM, Tenghuan He wrote: > Hi all, > > I want to create an empty rdd and partition it > > val buffer: RDD[(K, (V,

Dataframe to parquet using hdfs or parquet block size

2016-04-07 Thread bdev
I need to save the dataframe to parquet format and need some input on choosing the appropriate block size to help efficiently parallelize/localize the data to the executors. Should I be using parquet block size or hdfs block size and what is the optimal block size to use on a 100 node cluster?