Re: Verifying number of workers in Spark Streaming

2015-06-21 Thread Silvio Fiorito
If you look at your streaming app UI you should see how many tasks are executed each batch and on how many executors. This is dependent on the batch duration and block interval, which defaults to 200ms. So every block interval a partition will be generated. You can control the parallelism by

Spark Titan

2015-06-21 Thread Madabhattula Rajesh Kumar
Hi, How to connect TItan database from Spark? Any out of the box api's available? Regards, Rajesh

Re: Spark Titan

2015-06-21 Thread Akhil Das
Have a look at http://s3.thinkaurelius.com/docs/titan/0.5.0/titan-io-format.html You could use those Input/Output formats with newAPIHadoopRDD api call. Thanks Best Regards On Sun, Jun 21, 2015 at 8:50 PM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi, How to connect TItan

Re: Spark 1.4 History Server - HDP 2.2

2015-06-21 Thread Steve Loughran
On 20 Jun 2015, at 17:37, Ashish Soni asoni.le...@gmail.com wrote: Can any one help i am getting below error when i try to start the History Server I do not see any org.apache.spark.deploy.yarn.history.pakage inside the assembly jar not sure how to get that

Fwd: Java Constructor Issues

2015-06-21 Thread Shaanan Cohney
Hi all, I'm having an issue running some code that works on a build of spark I made (and still have) but now rebuilding it again, I get the below traceback. I built it using the 1.4.0 release, profile hadoop-2.4 but version 2.7 and I'm using python3. It's not vital to my work (as I can use my

Re: Java Constructor Issues

2015-06-21 Thread Davies Liu
The compiled jar is not consistent with Python source, maybe you are using a older version pyspark, but with assembly jar of Spark Core 1.4? On Sun, Jun 21, 2015 at 7:24 AM, Shaanan Cohney shaan...@gmail.com wrote: Hi all, I'm having an issue running some code that works on a build of spark

Re: Using Accumulators in Streaming

2015-06-21 Thread Michal Čizmazia
StreamingContext.sparkContext() On 21 June 2015 at 21:32, Will Briggs wrbri...@gmail.com wrote: It sounds like accumulators are not necessary in Spark Streaming - see this post ( http://apache-spark-user-list.1001560.n3.nabble.com/Shared-variable-in-Spark-Streaming-td11762.html) for more

PartitionBy/Partitioner for dataFrames?

2015-06-21 Thread Tom Hubregtsen
Hi, I am trying to rewrite my program to use dataFrames, and I see that I can perform a mapPartitions and a foreachPartition, but can I perform a partitionBy/set a partitioner? Or is there some other way to make my data land in the right partition for *Partition to use? (I see that PartitionBy is

[Spark 1.3.1 SQL] Using Hive

2015-06-21 Thread Mike Frampton
Hi Is it true that if I want to use Spark SQL ( for Spark 1.3.1 ) against Apache Hive I need to build a source version of Spark ? Im using CDH 5.3 on CentOS Linux 6.5 which uses Hive 0.13.0 ( I think ). cheers Mike F

Re: Task Serialization Error on DataFrame.foreachPartition

2015-06-21 Thread Ted Yu
Can you show us the code for loading Hive into hbase ? There shouldn't be 'return' statement in that code. Cheers On Jun 20, 2015, at 10:10 PM, Nishant Patel nishant.k.pa...@gmail.com wrote: Hi, I am loading data from Hive table to Hbase after doing some manipulation. I am getting

Problem attaching to YARN

2015-06-21 Thread Shawn Garbett
I've spent the last 3 days trying to get a connection to YARN from spark on a single box to work through examples. I'm at a loss. It's a dual core box, running Jessie Debian. I've tried both Java 7 and Java 8 from Oracle. It has Hadoop 2.7 installed and YARN running. Scala version 2.10.4, and

Re: Using Accumulators in Streaming

2015-06-21 Thread Will Briggs
It sounds like accumulators are not necessary in Spark Streaming - see this post ( http://apache-spark-user-list.1001560.n3.nabble.com/Shared-variable-in-Spark-Streaming-td11762.html) for more details. On June 21, 2015, at 7:31 PM, anshu shukla anshushuk...@gmail.com wrote: In spark Streaming

Spark 1.4.0 SQL JDBC partition stride?

2015-06-21 Thread Keith Freeman
The spark docs section for JDBC to Other Databases (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) describes the partitioning as ... Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in

Reducer memory usage

2015-06-21 Thread Corey Nolet
I've seen a few places where it's been mentioned that after a shuffle each reducer needs to pull its partition into memory in its entirety. Is this true? I'd assume the merge sort that needs to be done (in the cases where sortByKey() is not used) wouldn't need to pull all of the data into memory

How to use an different version of hive

2015-06-21 Thread Sea
Hi, all: We have an own version of hive 0.13.1, we alter the code about permissions of operating table and an issue of hive 0.13.1 HIVE-6131 Spark 1.4.0 support different versions of hive metastore, who can give an example? I am confused of these spark.sql.hive.metastore.jars

memory needed for each executor

2015-06-21 Thread pth001
Hi, How can I know the size of memory needed for each executor (one core) to execute each job? If there are many cores per executors, will the memory be the multiplication (memory needed for each executor (one core) * no. of cores)? Any suggestions/guidelines? BR, Patcharee

Re?? Abount Jobs UI in yarn-client mode

2015-06-21 Thread Sea
Thanks?? it is ok now?? -- -- ??: Gavin Yue;yue.yuany...@gmail.com; : 2015??6??21??(??) 4:40 ??: Sea261810...@qq.com; : useruser@spark.apache.org; : Re: Abount Jobs UI in yarn-client mode I got the same problem when

Fwd: How to get and parse whole xml file in HDFS by Spark Streaming

2015-06-21 Thread Yong Feng
Hi Spark Experts I have a customer who wants to monitor coming data files (with xml format), and then analysize them after that put analysized data into DB. The size of each file is about 30MB (or even less in future). Spark streaming seems promising. After learning Spark Streaming and also

Re: Velox Model Server

2015-06-21 Thread Sean Owen
Out of curiosity why netty? What model are you serving? Velox doesn't look like it is optimized for cases like ALS recs, if that's what you mean. I think scoring ALS at scale in real time takes a fairly different approach. The servlet engine probably doesn't matter at all in comparison. On Sat,

Updation of Static variable inside foreachRDD method

2015-06-21 Thread anshu shukla
I want to log timestamp of every element of the RDD so i have assigned the MSGid to every elemnt inside RDD,and increamented it.(static variable). My code is giving distinct Msgid in local mode but in cluster mode this value is duplicated every 30-40 count. Please help !! //public static long

s3 - Can't make directory for path

2015-06-21 Thread nizang
hi, I'm trying to setup a standalone server, and in one of my tests, I got the following exception: java.io.IOException: Can't make directory for path 's3n://ww-sandbox/name_of_path' since it is a file. at

Using Accumulators in Streaming

2015-06-21 Thread anshu shukla
In spark Streaming ,Since we are already having Streaming context , which does not allows us to have accumulators .We have to get sparkContext for initializing accumulator value . But having 2 spark context will not serve the problem . Please Help !! -- Thanks Regards, Anshu Shukla

Re: Spark Titan

2015-06-21 Thread Nick Pentreath
Something like this works (or at least worked with titan 0.4 back when I was using it): val graph = sc.newAPIHadoopRDD( configuration, fClass = classOf[TitanHBaseInputFormat], kClass = classOf[NullWritable], vClass = classOf[FaunusVertex]) graph.flatMap { vertex =

Re: Velox Model Server

2015-06-21 Thread Stephen Boesch
Oryx 2 has a scala client https://github.com/OryxProject/oryx/blob/master/framework/oryx-api/src/main/scala/com/cloudera/oryx/api/ 2015-06-20 11:39 GMT-07:00 Debasish Das debasish.da...@gmail.com: After getting used to Scala, writing Java is too much work :-) I am looking for scala based

Driver and Executor on the same machine

2015-06-21 Thread DStrip
I am using Spark Standalone mode to deploy a Spark cluster (Spark v.1.2.1) on 3 machines. The cluster is as follows: Machine A: Spark Master + Spark Worker Machine B: Spark Worker Machine C: Spark Worker I start the spark driver from the /bin/spark-shell of the Machine A. Even though for the

Re: Velox Model Server

2015-06-21 Thread Nick Pentreath
Is there a presentation up about this end-to-end example? I'm looking into velox now - our internal model pipeline just saves factors to S3 and model server loads them periodically from S3 — Sent from Mailbox On Sat, Jun 20, 2015 at 9:46 PM, Debasish Das debasish.da...@gmail.com wrote: