Re: Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-20 Thread Jean-Pascal Billaud
Hey, so I start the context at the very end when all the piping is done. BTW a foreachRDD will be called on the resulting dstream.map() right after that. The puzzling thing is why removing the context bounds solve the problem... What does this exception mean in general? On Mon, Apr 20, 2015 at

Why is Columnar Parquet used as default for saving Row-based DataFrames/RDD?

2015-04-20 Thread Duy Lan Nguyen
Hello, I have the above naive question if anyone could help. Why not using a Row-based File format to save Row-based DataFrames/RDD? Thanks, Lan

Re: Streaming Linear Regression problem

2015-04-20 Thread Xiangrui Meng
Did you keep adding new files under the `train/` folder? What was the exact warn message? -Xiangrui On Fri, Apr 17, 2015 at 4:56 AM, barisak baris.akg...@gmail.com wrote: Hi, I write this code for just train the Stream Linear Regression, but I took no data found warn, so no weights were not

ChiSquared Test from user response flat files to RDD[Vector]?

2015-04-20 Thread Dan DeCapria, CivicScience
Hi Spark community, I'm very new to the Apache Spark community; but if this (very active) group is anything like the other Apache project user groups I've worked with, I'm going to enjoy discussions here very much. Thanks in advance! Use Case: I am trying to go from flat files of user response

RE: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster

2015-04-20 Thread Andrew Lee
Hi Marcelo, Exactly what I need to track, thanks for the JIRA pointer. Date: Mon, 20 Apr 2015 14:03:55 -0700 Subject: Re: GSSException when submitting Spark job in yarn-cluster mode with HiveContext APIs on Kerberos cluster From: van...@cloudera.com To: alee...@hotmail.com CC:

Re: SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Does anybody have an idea? a clue? a hint? Thanks! Renato M. 2015-04-20 9:31 GMT+02:00 Renato Marroquín Mogrovejo renatoj.marroq...@gmail.com: Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up

Re: SparkSQL performance

2015-04-20 Thread ayan guha
SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are not taking advantage of either. I am curious to know what goes in your filter function, as you are not using a filter in SQL side. Best Ayan On 21 Apr 2015 08:05, Renato Marroquín Mogrovejo

Re: ChiSquared Test from user response flat files to RDD[Vector]?

2015-04-20 Thread Xiangrui Meng
You can find the user guide for vector creation here: http://spark.apache.org/docs/latest/mllib-data-types.html#local-vector. -Xiangrui On Mon, Apr 20, 2015 at 2:32 PM, Dan DeCapria, CivicScience dan.decap...@civicscience.com wrote: Hi Spark community, I'm very new to the Apache Spark

Re: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-20 Thread Tathagata Das
Responses inline. On Mon, Apr 20, 2015 at 3:27 PM, Ankit Patel patel7...@hotmail.com wrote: What you said is correct and I am expecting the printlns to be in my console or my SparkUI. I do not see it in either places. Can you actually login into the machine running the executor which runs the

HiveContext vs SQLContext

2015-04-20 Thread Daniel Mahler
Is HiveContext still preferred over SQLContext? What are the current (1.3.1) diferences between them? thanks Daniel

Re: MLlib - Naive Bayes Problem

2015-04-20 Thread Xiangrui Meng
Could you attach the full stack trace? Please also include the stack trace from executors, which you can find on the Spark WebUI. -Xiangrui On Thu, Apr 16, 2015 at 1:00 PM, riginos samarasrigi...@gmail.com wrote: I have a big dataset of categories of cars and descriptions of cars. So i want to

Instantiating/starting Spark jobs programmatically

2015-04-20 Thread Ajay Singal
Greetings, We have an analytics workflow system in production. This system is built in Java and utilizes other services (including Apache Solr). It works fine with moderate level of data/processing load. However, when the load goes beyond certain limit (e.g., more than 10 million

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Ted Yu
Please take a look at https://issues.apache.org/jira/browse/PHOENIX-1815 On Mon, Apr 20, 2015 at 10:11 AM, Jeetendra Gangele gangele...@gmail.com wrote: Thanks for reply. Does phoenix using inside Spark will be useful? what is the best way to bring data from Hbase into Spark in terms

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread ayan guha
I think recommended use will be creating a dataframe using hbase as source. Then you can run any SQL on that DF. In 1.2 you can create a base rdd and then apply schema in the same manner On 21 Apr 2015 03:12, Jeetendra Gangele gangele...@gmail.com wrote: Thanks for reply. Does phoenix using

Re: SparkSQL performance

2015-04-20 Thread Michael Armbrust
There is a cost to converting from JavaBeans to Rows and this code path has not been optimized. That is likely what you are seeing. On Mon, Apr 20, 2015 at 3:55 PM, ayan guha guha.a...@gmail.com wrote: SparkSQL optimizes better by column pruning and predicate pushdown, primarily. Here you are

Re: spark sql error with proto/parquet

2015-04-20 Thread Michael Armbrust
You are probably using an encoding that we don't support. I think this PR may be adding that support: https://github.com/apache/spark/pull/5422 On Sat, Apr 18, 2015 at 5:40 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: I have created a bunch of protobuf based parquet files that

Re: Updating a Column in a DataFrame

2015-04-20 Thread ayan guha
You can always create another DF using a map. In reality operations are lazy so only final value will get computed. Can you provide the usecase in little more detail? On 21 Apr 2015 08:39, ARose ashley.r...@telarix.com wrote: In my Java application, I want to update the values of a Column in a

Why is Columnar Parquet used as default for saving Row-based DataFrames/RDD?

2015-04-20 Thread Lan
Hello, I have the above naive question if anyone could help. Why not using a Row-based File format to save Row-based DataFrames/RDD? Thanks, Lan -- View this message in context:

Re: how to make a spark cluster ?

2015-04-20 Thread haihar nahak
Thank you :) On Mon, Apr 20, 2015 at 4:46 PM, Jörn Franke jornfra...@gmail.com wrote: Hi, If you have just one physical machine then I would try out Docker instead of a full VM (would be waste of memory and CPU). Best regards Le 20 avr. 2015 00:11, hnahak harihar1...@gmail.com a écrit :

RE: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-20 Thread Ankit Patel
What you said is correct and I am expecting the printlns to be in my console or my SparkUI. I do not see it in either places. However, if you run the program then the printlns do print for the constructor of the receiver and the for the foreach statements with total count 0. When you run it in

Updating a Column in a DataFrame

2015-04-20 Thread ARose
In my Java application, I want to update the values of a Column in a given DataFrame. However, I realize DataFrames are immutable, and therefore cannot be updated by conventional means. Is there a workaround for this sort of transformation? If so, can someone provide an example? -- View this

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-04-20 Thread Xiangrui Meng
You should check where MyDenseVectorUDT is defined and whether it was on the classpath (or in the assembly jar) at runtime. Make sure the full class name (with package name) is used. Btw, UDTs are not public yet, so please use it with caution. -Xiangrui On Fri, Apr 17, 2015 at 12:45 AM, Jaonary

Re: Spark Performance on Yarn

2015-04-20 Thread Peng Cheng
I got exactly the same problem, except that I'm running on a standalone master. Can you tell me the counterpart parameter on standalone master for increasing the same memroy overhead? -- View this message in context:

MLlib - Collaborative Filtering - trainImplicit task size

2015-04-20 Thread Christian S. Perone
I keep seeing these warnings when using trainImplicit: WARN TaskSetManager: Stage 246 contains a task of very large size (208 KB). The maximum recommended task size is 100 KB. And then the task size starts to increase. Is this a known issue ? Thanks ! -- Blog http://blog.christianperone.com |

meet weird exception when studying rdd caching

2015-04-20 Thread donhoff_h
Hi, I am studying the RDD Caching function and write a small program to verify it. I run the program in a Spark1.3.0 environment and on Yarn cluster. But I meet a weird exception. It isn't always generated in the log. Only sometimes I can see this exception. And it does not affect the output

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-20 Thread ๏̯͡๏
It also is a little more evidence to Jonathan's suggestion that there is a null / 0 record that is getting grouped together. To fix this, do i need to run a filter ? val viEventsRaw = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) } val viEvents = viEventsRaw.filter { case

Re: Map-Side Join in Spark

2015-04-20 Thread ayan guha
In my understanding you need to create a key out of the data and repartition both datasets to achieve map side join. On 21 Apr 2015 14:10, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Can someone share their working code of Map Side join in Spark + Scala. (No Spark-SQL) The only resource i could

Spark and accumulo

2015-04-20 Thread madhvi
Hi all, Is there anything to integrate spark with accumulo or make spark to process over accumulo data? Thanks Madhvi Gupta - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail:

Re: Instantiating/starting Spark jobs programmatically

2015-04-20 Thread firemonk9
I have built a data analytics SaaS platform by creating Rest end points and based on the type of job request I would invoke the necessary spark job/jobs and return the results as json(async). I used yarn-client mode to submit the jobs to yarn cluster. hope this helps. -- View this

Re: Equi Join is taking for ever. 1 Task is Running while other 199 are complete

2015-04-20 Thread ๏̯͡๏
After the above changes 1) filter shows this Tasks IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch Time DurationGC TimeInput Size / RecordsWrite TimeShuffle Write Size / Records Errors 0 1 0 SUCCESS ANY 1 / phxaishdc9dn1571.stratus.phx.ebay.com 2015/04/20 20:55:31 7.4 min 21 s 129.7

Map-Side Join in Spark

2015-04-20 Thread ๏̯͡๏
Can someone share their working code of Map Side join in Spark + Scala. (No Spark-SQL) The only resource i could find was this (Open in chrome with Chinese to english translator) http://dongxicheng.org/framework-on-yarn/apache-spark-join-two-tables/ -- Deepak

Re: Map-Side Join in Spark

2015-04-20 Thread Punyashloka Biswal
Could you do it using flatMap? Punya On Tue, Apr 21, 2015 at 12:19 AM ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: The reason am asking this is, i am not able to understand how do i do a skip. 1) Broadcast small table-1 as map. 2) I jun do .map() on large table-2. When you do .map()

Re: Map-Side Join in Spark

2015-04-20 Thread ๏̯͡๏
I did this val lstgItemMap = listings.map { lstg = (lstg.getItemId().toLong, lstg) }.collectAsMap val broadCastMap = sc.broadcast(lstgItemMap) val viEventsWithListings: RDD[(Long, (DetailInputRecord, VISummary, Long))] = viEvents.mapPartitions({ iter = val lstgItemMap =

Re: invalidate caching for hadoopFile input?

2015-04-20 Thread ayan guha
You can use rdd.unpersist. its documented in spark programming guide page under Removing Data section. Ayan On 21 Apr 2015 13:16, Wei Wei vivie...@gmail.com wrote: Hey folks, I am trying to load a directory of avro files like this in spark-shell: val data =

invalidate caching for hadoopFile input?

2015-04-20 Thread Wei Wei
Hey folks, I am trying to load a directory of avro files like this in spark-shell: val data = sqlContext.avroFile(hdfs://path/to/dir/*).cache data.count This works fine, but when more files are uploaded to that directory running these two lines again yields the same result. I suspect there is

Re: Map-Side Join in Spark

2015-04-20 Thread ๏̯͡๏
The reason am asking this is, i am not able to understand how do i do a skip. 1) Broadcast small table-1 as map. 2) I jun do .map() on large table-2. When you do .map() you must map each element to a new element. However with map-side join, when i get the broadcasted map, i will search in

Re: Map-Side Join in Spark

2015-04-20 Thread ๏̯͡๏
What is re-partition ? On Tue, Apr 21, 2015 at 10:23 AM, ayan guha guha.a...@gmail.com wrote: In my understanding you need to create a key out of the data and repartition both datasets to achieve map side join. On 21 Apr 2015 14:10, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote: Can someone share

Re: Streaming problems running 24x7

2015-04-20 Thread Luis Ángel Vicente Sánchez
You have a window operation; I have seen that behaviour before with window operations in spark streaming. My solution was to move away from window operations using probabilistic data structures; it might not be an option for you. 2015-04-20 10:29 GMT+01:00 Marius Soutier mps@gmail.com: The

Re: writing to hdfs on master node much faster

2015-04-20 Thread Sean Owen
What machines are HDFS data nodes -- just your master? that would explain it. Otherwise, is it actually the write that's slow or is something else you're doing much faster on the master for other reasons maybe? like you're actually shipping data via the master first in some local computation? so

When the old data dropped from the cache?

2015-04-20 Thread Tash Chainar
Hi all, On https://spark.apache.org/docs/latest/programming-guide.html under the RDD Persistence Removing Data, it states Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion. Can it be understood that the cache will

Re: writing to hdfs on master node much faster

2015-04-20 Thread Tamas Jambor
Not sure what would slow it down as the repartition completes equally fast on all nodes, implying that the data is available on all, then there are a few computation steps none of them local on the master. On Mon, Apr 20, 2015 at 12:57 PM, Sean Owen so...@cloudera.com wrote: What machines are

RE: writing to hdfs on master node much faster

2015-04-20 Thread Evo Eftimov
Check whether your partitioning results in balanced partitions ie partitions with similar sizes - one of the reasons for the performance differences observed by you may be that after your explicit repartitioning, the partition on your master node is much smaller than the RDD partitions on the

Spark 1.3.1 - SQL Issues

2015-04-20 Thread ayan guha
Hi Just upgraded to Spark 1.3.1. I am getting an warning Warning (from warnings module): File D:\spark\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\spark-1.3.1-bin-hadoop2.6\python\pyspark\sql\context.py, line 191 warnings.warn(inferSchema is deprecated, please use createDataFrame

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
I think the memory requested by your job 2.0 GB is higher than what is requested. Please request for 256 MB explicitly which creating Spark Context and try again. Thanks and Regards, Suraj Sheth On Mon, Apr 20, 2015 at 2:44 PM, madhvi madhvi.gu...@orkash.com wrote: PFA screenshot of my

Re: Running spark over HDFS

2015-04-20 Thread Archit Thakur
There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit Thakur. On Mon, Apr 20, 2015 at 3:05 PM, madhvi

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Archit Thakur
The same, which is between map and foreach. map takes iterator returns iterator foreach takes iterator returns Unit. On Mon, Apr 20, 2015 at 4:05 PM, Arun Patel arunp.bigd...@gmail.com wrote: What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 03:18 PM, Archit Thakur wrote: There are lot of similar problems shared and resolved by users on this same portal. I have been part of those discussions before, Search those, Please Try them and let us know, if you still face problems. Thanks and Regards, Archit

RE: SparkStreaming onStart not being invoked on CustomReceiver attached to master with multiple workers

2015-04-20 Thread Ankit Patel
The code I've written is simple as it just invokes a thread and calls a store method on the Receiver class. I see this code with printlns working fine when I try spark-submit --jars jar --class test.TestCustomReceiver jar However it does not work with I try the same command above with --master

Re: history-server does't read logs which are on FS

2015-04-20 Thread Serega Sheypak
Thanks, it helped. We can't use Spark 1.3 because Cassandra DSE doesn't support it. 2015-04-17 21:48 GMT+02:00 Imran Rashid iras...@cloudera.com: are you calling sc.stop() at the end of your applications? The history server only displays completed applications, but if you don't call

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Archit Thakur
True. On Mon, Apr 20, 2015 at 4:14 PM, Arun Patel arunp.bigd...@gmail.com wrote: mapPartitions is a transformation and foreachPartition is a an action? Thanks Arun On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur archit279tha...@gmail.com wrote: The same, which is between map and foreach.

SparkSQL performance

2015-04-20 Thread Renato Marroquín Mogrovejo
Hi all, I have a simple query Select * from tableX where attribute1 between 0 and 5 that I run over a Kryo file with four partitions that ends up being around 3.5 million rows in our case. If I run this query by doing a simple map().filter() it takes around ~9.6 seconds but when I apply schema,

[pyspark] Starting workers in a virtualenv

2015-04-20 Thread Karlson
Hi all, I am running the Python process that communicates with Spark in a virtualenv. Is there any way I can make sure that the Python processes of the workers are also started in a virtualenv? Currently I am getting ImportErrors when the worker tries to unpickle stuff that is not installed

Re: Running spark over HDFS

2015-04-20 Thread Akhil Das
Are you seeing your task being submitted to the UI? Under completed or running tasks? How much resources are you allocating for your job? Can you share a screenshot of your cluster UI and the code snippet that you are trying to run? Thanks Best Regards On Mon, Apr 20, 2015 at 12:37 PM, madhvi

Re: NEWBIE/not able to connect to postgresql using jdbc

2015-04-20 Thread Akhil Das
try doing a sc.addJar(path\to\your\postgres\jar) Thanks Best Regards On Mon, Apr 20, 2015 at 12:26 PM, shashanksoni shashankso...@gmail.com wrote: I am using spark 1.3 standalone cluster on my local windows and trying to load data from one of our server. Below is my code - import os

writing to hdfs on master node much faster

2015-04-20 Thread jamborta
Hi all, I have a three node cluster with identical hardware. I am trying a workflow where it reads data from hdfs, repartitions it and runs a few map operations then writes the results back to hdfs. It looks like that all the computation, including the repartitioning and the maps complete within

Re: Running spark over HDFS

2015-04-20 Thread madhvi
On Monday 20 April 2015 02:52 PM, SURAJ SHETH wrote: Hi Madhvi, I think the memory requested by your job, i.e. 2.0 GB is higher than what is available. Please request for 256 MB explicitly while creating Spark Context and try again. Thanks and Regards, Suraj Sheth Tried the same but still

Re: Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Conor Fennell
I looking for that build too. -Conor On Mon, Apr 20, 2015 at 9:18 AM, Marius Soutier mps@gmail.com wrote: Same problem here... On 20.04.2015, at 09:59, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on

Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Zsolt Tóth
Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on the mirror sites. Am I missing something? Regards, Zsolt

RE: compliation error

2015-04-20 Thread Brahma Reddy Battula
Any pointers to this issue..? Thanks Regards Brahma Reddy Battula From: Brahma Reddy Battula [brahmareddy.batt...@huawei.com] Sent: Monday, April 20, 2015 9:30 AM To: Sean Owen; Ted Yu Cc: user@spark.apache.org Subject: RE: compliation error Thanks

Re: Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Marius Soutier
Same problem here... On 20.04.2015, at 09:59, Zsolt Tóth toth.zsolt@gmail.com wrote: Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on the mirror sites. Am I missing something? Regards, Zsolt

Re: sparksql - HiveConf not found during task deserialization

2015-04-20 Thread Akhil Das
Can you try sc.addJar(/path/to/your/hive/jar), i think it will resolve it. Thanks Best Regards On Mon, Apr 20, 2015 at 12:26 PM, Manku Timma manku.tim...@gmail.com wrote: Akhil, But the first case of creating HiveConf on the executor works fine (map case). Only the second case fails. I was

Re: Running spark over HDFS

2015-04-20 Thread madhvi
Hi, I Did the same you told but now it is giving the following error: ERROR TaskSchedulerImpl: Exiting due to error from cluster scheduler: All masters are unresponsive! Giving up. On UI it is showing that master is working Thanks Madhvi On Monday 20 April 2015 12:28 PM, Akhil Das wrote: In

RE: shuffle.FetchFailedException in spark on YARN job

2015-04-20 Thread Shao, Saisai
I don’t think this problem is related to Netty or NIO, switching to nio will not change this part of code path to get the index file for sort-based shuffle reader. I think you could check your system from some aspects: 1. Is there any hardware problem like disk full or others which makes this

Re: Addition of new Metrics for killed executors.

2015-04-20 Thread Archit Thakur
Hi Twinkle, We have a use case in where we want to debug the reason of how n why an executor got killed. Could be because of stackoverflow, GC or any other unexpected scenario. If I see the driver UI there is no information present around killed executors, So was just curious how do people

Re: Streaming problems running 24x7

2015-04-20 Thread Marius Soutier
The processing speed displayed in the UI doesn’t seem to take everything into account. I also had a low processing time but had to increase batch duration from 30 seconds to 1 minute because waiting batches kept increasing. Now it runs fine. On 17.04.2015, at 13:30, González Salgado, Miquel

Re: Addition of new Metrics for killed executors.

2015-04-20 Thread twinkle sachdeva
Hi Archit, What is your use case and what kind of metrics are you planning to add? Thanks, Twinkle On Fri, Apr 17, 2015 at 4:07 PM, Archit Thakur archit279tha...@gmail.com wrote: Hi, We are planning to add new Metrics in Spark for the executors that got killed during the execution. Was

Re: Running spark over HDFS

2015-04-20 Thread madhvi
No I am not getting any task on the UI which I am running.Also I have set instances=1 but on UI it is showing 2 workers.i am running the java word count code exactly but i have the text file in HDFS.Following is the part of my code I writing to make connection SparkConf sparkConf = new

mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
What is difference between mapPartitions vs foreachPartition? When to use these? Thanks, Arun

Re: Running spark over HDFS

2015-04-20 Thread SURAJ SHETH
Hi Madhvi, I think the memory requested by your job, i.e. 2.0 GB is higher than what is available. Please request for 256 MB explicitly while creating Spark Context and try again. Thanks and Regards, Suraj Sheth

Order of execution of tasks inside of a stage and computing the number of stages

2015-04-20 Thread Spico Florin
Hello! I'm newbie in spark I would like to understand some basic mechanism on how it works behind the scenes. I have attached the lineage of my RDD and I have the following questions: 1. Why do I have 8 stages instead of 5? From the book Learning from Spark (Chapter 8 -http://bit.ly/1E0Hah7), I

Re: mapPartitions vs foreachPartition

2015-04-20 Thread Arun Patel
mapPartitions is a transformation and foreachPartition is a an action? Thanks Arun On Mon, Apr 20, 2015 at 4:38 AM, Archit Thakur archit279tha...@gmail.com wrote: The same, which is between map and foreach. map takes iterator returns iterator foreach takes iterator returns Unit. On Mon, Apr

Custom Partitioning Spark

2015-04-20 Thread mas
Hi, I aim to do custom partitioning on a text file. I first convert it into pairRDD and then try to use my custom partitioner. However, somehow it is not working. My code snippet is given below. val file=sc.textFile(filePath) val locLines=file.map(line = line.split(\t)).map(line=

Understanding the build params for spark with sbt.

2015-04-20 Thread Shiyao Ma
Hi. My usage is only about the spark core and hdfs, so no spark sql or mlib or other components invovled. I saw the hint on the http://spark.apache.org/docs/latest/building-spark.html, with a sample like: build/sbt -Pyarn -Phadoop-2.3 assembly. (what's the -P for?) Fundamentally, I'd like to

Issue of running partitioned loading (RDD) in Spark External Datasource on Mesos

2015-04-20 Thread Yang Lei
I implemented two kinds of DataSource, one load data during buildScan, the other returning my RDD class with partition information for future loading. My RDD's compute gets actorSystem from SparkEnv.get.actorSystem, then use Spray to interact with a HTTP endpoint, which is the same flow as

Re: Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
I also see that its creating both receivers on the same executor and that might be the cause of having more RDDs on executor than the other. Can I suggest spark to create each receiver on a each executor  Regards,Laeeq On Monday, April 20, 2015 4:51 PM, Evo Eftimov evo.efti...@isecc.com

Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Jeetendra Gangele
HI All, I am Querying Hbase and combining result and using in my spake job. I am querying hbase using Hbase client api inside my spark job. can anybody suggest me will Spark SQl will be fast enough and provide range of queries? Regards Jeetendra

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
What is meant by “streams” here: 1. Two different DSTream Receivers producing two different DSTreams consuming from two different kafka topics, each with different message rate 2. One kafka topic (hence only one message rate to consider) but with two different DStream receivers

Fail to read files from s3 after upgrading to 1.3

2015-04-20 Thread Ophir Cohen
Hi, Today I upgraded our code and cluster to 1.3. We are using Spark 1.3 in Amazon EMR, ami 3.6, include history server and Ganglia. I also migrated all deprecated SchemaRDD into DataFrame. Now when I'm trying to read a parquet files from s3 I get the below exception. Actually it not a problem if

Re: Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
They both have same message rates, 300 record/sec On Monday, April 20, 2015 4:51 PM, Evo Eftimov evo.efti...@isecc.com wrote: #yiv8130515999 #yiv8130515999 -- _filtered #yiv8130515999 {font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv8130515999 {panose-1:2 4 5 3

Re: Fail to read files from s3 after upgrading to 1.3

2015-04-20 Thread Ophir Cohen
Interesting: Remove the history server, '-a' option and using ami 3.5 fixed the problem. Now the question is: what made the change?... I vote for the '-a' but let me update... On Mon, Apr 20, 2015 at 5:43 PM, Ophir Cohen oph...@gmail.com wrote: Hi, Today I upgraded our code and cluster to 1.3.

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
Rename your log4j_special.properties file as log4j.properties and place it under the root of your jar file, you should be fine. If you are using Maven to build your jar, please the log4j.properties in the src/main/resources folder. However, please note that if you have other dependency jar

Configuring logging properties for executor

2015-04-20 Thread Michael Ryabtsev
Hi all, I need to configure spark executor log4j.properties on a standalone cluster. It looks like placing the relevant properties file in the spark configuration folder and setting the spark.executor.extraJavaOptions from my application code: sparkConf.set(spark.executor.extraJavaOptions,

Re: Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
Hi, I have two different topics and two Kafka receivers, one for each topic. Regards,Laeeq On Monday, April 20, 2015 4:28 PM, Evo Eftimov evo.efti...@isecc.com wrote: #yiv4992037734 #yiv4992037734 -- _filtered #yiv4992037734 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv4992037734

Equal number of RDD Blocks

2015-04-20 Thread Laeeq Ahmed
Hi, I have two streams of data from kafka. How can I make approx. equal number of RDD blocks of on two executors.Please see the attachement, one worker has 1785 RDD blocks and the other has 26.  Regards,Laeeq - To unsubscribe,

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
And what is the message rate of each topic mate – that was the other part of the required clarifications From: Laeeq Ahmed [mailto:laeeqsp...@yahoo.com] Sent: Monday, April 20, 2015 3:38 PM To: Evo Eftimov; user@spark.apache.org Subject: Re: Equal number of RDD Blocks Hi, I have two

RE: Equal number of RDD Blocks

2015-04-20 Thread Evo Eftimov
Well spark steraming is supposed to create / distribute the Receivers on different cluster nodes. If you are saying that actually your receivers are running on the same node probably that node is getting most of the data to minimize the network transfer costs If you want to distribute your

Shuffle files not cleaned up (Spark 1.2.1)

2015-04-20 Thread N B
Hi all, I had posed this query as part of a different thread but did not get a response there. So creating a new thread hoping to catch someone's attention. We are experiencing this issue of shuffle files being left behind and not being cleaned up by Spark. Since this is a Spark streaming

Task not Serializable: Graph is unexpectedly null when DStream is being serialized

2015-04-20 Thread Jean-Pascal Billaud
Hi, I am getting this serialization exception and I am not too sure what Graph is unexpectedly null when DStream is being serialized means? 15/04/20 06:12:38 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Task not serializable) Exception

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-20 Thread Jeetendra Gangele
Write a crone job for this like below 12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+ 32 * * * * find /tmp -type d -cmin +1440 -name spark-*-*-* -prune -exec rm -rf {} \+ 52 * * * * find $SPARK_LOCAL_DIR -mindepth 1 -maxdepth 1 -type d -cmin +1440 -name spark-*-*-*

Re: Did anybody run Spark-perf on powerpc?

2015-04-20 Thread zapstar
This appears to be a problem with SSL. I'm facing the same issue.. Did you get around this somehow? I'm running IBM Java 8, on linux ppc64le. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Did-anybody-run-Spark-perf-on-powerpc-tp22329p22575.html Sent

Re: Configuring logging properties for executor

2015-04-20 Thread Michael Ryabtsev
Hi Lan, Thanks for fast response. It could be a solution if it works. I have more than one log4 properties file, for different run modes like debug/production, for executor and for application core. I think I would like to keep them separate. Then, I suppose I should give all other properties

Re: Configuring logging properties for executor

2015-04-20 Thread Lan Jiang
Each application gets its own executor processes, so there should be no problem running them in parallel. Lan On Apr 20, 2015, at 10:25 AM, Michael Ryabtsev michael...@gmail.com wrote: Hi Lan, Thanks for fast response. It could be a solution if it works. I have more than one log4

How can i custom the external initialize when start the spark cluster

2015-04-20 Thread ??????????
Hi All, I had a question about Spark thriftserver . I want to load some table when the Spark server started . How can i config the external initialization in spark , i guess the spark should had a interface can config in the spark-default.conf , and we can implements the initialization function

Re: Did anybody run Spark-perf on powerpc?

2015-04-20 Thread zapstar
This appears to be a problem with SSL. I'm facing the same issue.. Did you get around this somehow? I'm running IBM Java 8, on linux ppc64le. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Did-anybody-run-Spark-perf-on-powerpc-tp22329p22576.html Sent

Re: Fail to read files from s3 after upgrading to 1.3

2015-04-20 Thread Ophir Cohen
And the winner is: ami 3.6. Apparently it does not work with it... ami 3.5 works great. Interesting: Remove the history server, '-a' option and using ami 3.5 fixed the problem. Now the question is: what made the change?... I vote for the '-a' but let me update... On Mon, Apr 20, 2015 at 5:43 PM,

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Jeetendra Gangele
Thanks for reply. Does phoenix using inside Spark will be useful? what is the best way to bring data from Hbase into Spark in terms performance of application? Regards Jeetendra On 20 April 2015 at 20:49, Ted Yu yuzhih...@gmail.com wrote: To my knowledge, Spark SQL currently doesn't provide

Re: Spark SQL vs map reduce tableInputOutput

2015-04-20 Thread Ted Yu
To my knowledge, Spark SQL currently doesn't provide range scan capability against hbase. Cheers On Apr 20, 2015, at 7:54 AM, Jeetendra Gangele gangele...@gmail.com wrote: HI All, I am Querying Hbase and combining result and using in my spake job. I am querying hbase using Hbase

Unsupported types in org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType

2015-04-20 Thread ARose
So I am trying to pull data from an external database using JDBC MapString, String options = new HashMap(); options.put(driver, driver); options.put(url, dburl); options.put(dbtable, tmpTrunk); DataFrame tbTrunkInfo = sqlContext.load(jdbc, options); And

RE: Super slow caching in 1.3?

2015-04-20 Thread Evo Eftimov
Now this is very important: “Normal RDDs” refers to “batch RDDs”. However the default in-memory Serialization of RDDs which are part of DSTream is “Srialized” rather than actual (hydrated) Objects. The Spark documentation states that “Serialization” is required for space and garbage

Custom paritioning of DSTream

2015-04-20 Thread Evo Eftimov
Is the only way to implement a custom partitioning of DStream via the foreach approach so to gain access to the actual RDDs comprising the DSTReam and hence their paritionBy method DSTReam has only a repartition method accepting only the number of partitions, BUT not the method of partitioning

  1   2   >