Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi Zhan, Thank you for providing a workaround! I will try this out but I agree with Ted, there should be a better way to capture the exception and handle it by just initializing SQLContext instead of HiveContext. WARN the user that something is wrong with his hive setup. Having

What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha
Hi, What is the efficient way to join two RDDs? Would converting both the RDDs to IndexedRDDs be of any help? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-is-the-efficient-way-to-Join-two-RDDs-tp25310.html Sent from the Apache Spark

Re: What is the efficient way to Join two RDDs?

2015-11-06 Thread swetha kasireddy
I think they are roughly of equal size. On Fri, Nov 6, 2015 at 3:45 PM, Ted Yu wrote: > Can you tell us a bit more about your use case ? > > Are the two RDDs expected to be of roughly equal size or, to be of vastly > different sizes ? > > Thanks > > On Fri, Nov 6, 2015 at

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
Hi Jerry, OK. Here is an ugly walk around. Put a hive-site.xml under $SPARK_HOME/conf with invalid content. You will get a bunch of exceptions because hive context initialization failure, but you can initialize your SQLContext on your own. scala> val sqlContext = new

Re: kerberos question

2015-11-06 Thread Ruslan Dautkhanov
You could probably instead of specifying --principal [principle] --keytab [keytab] --proxy-user [proxied_user] ... arguments just create/renew a kerberos ticket before submitting a job $ kinit prinipal.name -kt keytab.file $ spark-submit ... Do you need impersonation / proxy user at all? I

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2015-11-06 Thread Kayode Odeyemi
Thank you. That seems to resolve it. On Fri, Nov 6, 2015 at 11:46 PM, Ted Yu wrote: > You mentioned resourcemanager but not nodemanagers. > > I think you need to install Spark on nodes running nodemanagers. > > Cheers > > On Fri, Nov 6, 2015 at 1:32 PM, Kayode Odeyemi

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
Hi all, Anyone could help me on this. It's a bit urgent for me on this. I'm very confused and curious about Spark data checkpoint performance? Is there any detail implementation of checkpoint I can look into? Spark Streaming only take sub-second to process 20K messages/sec, however it take 25

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
I agree with minor change. Adding a config to provide the option to init SQLContext or HiveContext, with HiveContext as default instead of bypassing when hitting the Exception. Thanks. Zhan Zhang On Nov 6, 2015, at 2:53 PM, Ted Yu > wrote: I

spark ec2 script doest not install necessary files to launch spark

2015-11-06 Thread Emaasit
Hello, I followed the instructions for launching Spark 1.5.1 on my AWS EC2 but the script is not installing all the folders/files required to initialize Spark. Since the log message is long, I have created a gist here: https://gist.github.com/Emaasit/696145959bbbd989bfe1 Please help. I have been

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Ted Yu
I would suggest adding a config parameter that allows bypassing initialization of HiveContext in case of SQLException Cheers On Fri, Nov 6, 2015 at 2:50 PM, Zhan Zhang wrote: > Hi Jerry, > > OK. Here is an ugly walk around. > > Put a hive-site.xml under $SPARK_HOME/conf

Re: What is the efficient way to Join two RDDs?

2015-11-06 Thread Ted Yu
Can you tell us a bit more about your use case ? Are the two RDDs expected to be of roughly equal size or, to be of vastly different sizes ? Thanks On Fri, Nov 6, 2015 at 3:21 PM, swetha wrote: > Hi, > > What is the efficient way to join two RDDs? Would converting

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
If you assembly jar have hive jar included, the HiveContext will be used. Typically, HiveContext has more functionality than SQLContext. In what case you have to use SQLContext that cannot be done by HiveContext? Thanks. Zhan Zhang On Nov 6, 2015, at 10:43 AM, Jerry Lam

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi Zhan, I don’t use HiveContext features at all. I use mostly DataFrame API. It is sexier and much less typo. :) Also, HiveContext requires metastore database setup (derby by default). The problem is that I cannot have 2 spark-shell sessions running at the same time in the same host (e.g.

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Zhan Zhang
Hi Jerry, https://issues.apache.org/jira/browse/SPARK-11562 is created for the issue. Thanks. Zhan Zhang On Nov 6, 2015, at 3:01 PM, Jerry Lam > wrote: Hi Zhan, Thank you for providing a workaround! I will try this out but I agree with Ted,

Re: spark ec2 script doest not install necessary files to launch spark

2015-11-06 Thread Alexander Pivovarov
try to use EMR-4.1.0. it has spark-1.5.0 running on yarn replace subnet-xxx with correct one $ aws emr create-cluster --name emr41_3 --release-label emr-4.1.0 --instance-groups InstanceCount=1,Name=sparkMaster,InstanceGroupType=MASTER,InstanceType=r3.2xlarge

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
Can you try storing the state (word count) in an external key value store? On Sat, Nov 7, 2015, 8:40 AM Thúy Hằng Lê wrote: > Hi all, > > Anyone could help me on this. It's a bit urgent for me on this. > I'm very confused and curious about Spark data checkpoint

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
It depends on the stats you are collecting. For example, if you just collecting counts, you can do away with updateStateByKey completely by doing insert or update operation on the data store after reduce. I.e. For each (key, batchCount) if (key exists in dataStore) update count = count +

Re: Fwd: Re: DataFrame equality does not working in 1.5.1

2015-11-06 Thread Seongduk Cheon
Hi, Michael It works find. scala> sqlContext.sql("SET spark.sql.inMemoryColumnarStorage.partitionPruning=false") res28: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> eventDF.filter($"entityType" === "user").select("entityId").distinct.count res29: Long = 2091 Thank you

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Thúy Hằng Lê
Thanks Aniket, I want to store the state to an external storage but it should be in later step I think. Basically, I have to use updateStateByKey function to maintain the running state (which requires checkpoint), and my bottleneck is now in data checkpoint. My pseudo code is like below:

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
I'd expect writing Parquet files slower than writing JSON files since Parquet involves more complicated encoders, but maybe not that slow. Would you mind to try to profile one Spark executor using tools like YJP to see what's the hotspot? Cheng On 11/6/15 7:34 AM, rok wrote: Apologies if

Re: Issue of Hive parquet partitioned table schema mismatch

2015-11-06 Thread Rex Xiong
spark.sql.hive.convertMetastoreParquet is true. I can't repro the issue of scanning all partitions now.. : P Anyway, I found another email thread "Re: Spark Sql behaves strangely with tables with a lot of partitions" I observe the same issue as Jerrick, spark driver will call listStatus for the

ResultStage's parent stages only ShuffleMapStages?

2015-11-06 Thread Jacek Laskowski
Hi, Just to make sure that what I see in the code and think I understand is indeed correct... When a job is submitted to DAGScheduler, it creates a new ResultStage that in turn queries for the parent stages of itself given the RDD (using `getParentStagesAndId` in `newResultStage`). Are a

[Streaming] Long time to catch up when streaming application restarts from checkpoint

2015-11-06 Thread Terry Hoo
All, I have a streaming application that monitors a HDFS folder and compute some metrics based on this data, the data in this folder will be updated by another uploaded application. The streaming application's batch interval is 1 minute, batch processing time of streaming is about 30 seconds,

Re: ResultStage's parent stages only ShuffleMapStages?

2015-11-06 Thread Jeff Zhang
Right, there're only 2 kinds of stage: ResultStage & ShuffleMapStage. ShuffleMapStage will shuffle its data for downstream consumption, but ResultStage don't need to do that. I guess you may be confused these concepts with Map/Reduce. Actually ShuffleMapStage could be represented as either Map

cartesian in the loop, runtime grows

2015-11-06 Thread efa
Hi All, I have problem with cartesian product. I build cartesian of two RDDs in the loop and the result is squeezed to the original size of one of participating variables. At the and of the iteration this result is assigned to the original variable. I expect same running time for each iteration,

Checkpointing an InputDStream from Kafka

2015-11-06 Thread Kathi Stutz
Hi all, I want to load an InputDStream from a checkkpoint, but I doesn't work, and after trying several things I have finally run out of ideas. So, here's what I do: 1. I create the streaming context - or load it from the checkpoint directory. def main(args: Array[String]) { val ssc =

Re: Get complete row with latest timestamp after a groupBy?

2015-11-06 Thread bghit
You are trying to get the top-k most recent records for each user (k=1 in your case). You should avoid using groupBy because it's an expensive operation that will hurt performance in Spark -- check out [1] for more details. Instead, you can use the combineByKey function with a custom combiner

Re: Dynamic Allocation & Spark Streaming

2015-11-06 Thread Adrian Tanase
You can register a streaming listener – in the BatchInfo you’ll find a lot of stats (including count of received records) that you can base your logic on: https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/StreamingListener.scala

Re: very slow parquet file write

2015-11-06 Thread Jörn Franke
Do you use some compression? Maybe there is some activated by default in your Hadoop environment? > On 06 Nov 2015, at 00:34, rok wrote: > > Apologies if this appears a second time! > > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a >

[Spark R]could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

2015-11-06 Thread Todd
I am launching spark R with following script: ./sparkR --driver-memory 12G and I try to load a local 3G csv file with following code, > a=read.transactions("/home/admin/datamining/data.csv",sep="\t",format="single",cols=c(1,2)) but I encounter an error: could not allocate memory (2048 Mb) in

Re: Unable to register UDF with StructType

2015-11-06 Thread Rishabh Bhardwaj
Thanks for your response. Actually I want to return a Row or (return struct). But, attempting to register UDF returning Row throws the following error, scala> def test(r:Row):Row = r test: (r: org.apache.spark.sql.Row)org.apache.spark.sql.Row scala> sqlContext.udf.register("test",test _)

Re: How to unpersist a DStream in Spark Streaming

2015-11-06 Thread Adrian Tanase
Do we have any guarantees on the maximum duration? I've seen RDDs kept around for 7-10 minutes on batches of 20 secs and checkpoint of 100 secs. No windows, just updateStateByKey. t's not a memory issue but on checkpoint recovery it goes back to Kafka for 10 minutes of data, any idea why?

Re: Dynamic Allocation & Spark Streaming

2015-11-06 Thread Kyle Lin
Hey there I run Spark streaming 1.5.1 on YARN with Dynamic allocation, and use direct stream API to read data from Kafka. Spark job can dynamically request a executor when reaching spark.dynamicAllocation.schedulerBacklogTimeout. However, it won't dynamically remove executor when there is no

Spark Streaming : minimum cores for a Receiver

2015-11-06 Thread mpals
As per the documentation : http://spark.apache.org/docs/latest/streaming-programming-guide.html#input-dstreams-and-receivers "if you want to receive multiple streams of data in parallel in your streaming application, you can create multiple input DStreams (discussed further in the Performance

Re: Get complete row with latest timestamp after a groupBy?

2015-11-06 Thread bghit
I asked the same question a few days ago, but I did not receive any answer. You may want to look into UDAFs for that. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Get-complete-row-with-latest-timestamp-after-a-groupBy-tp25304p25308.html Sent from the

Re: Fwd: Re: DataFrame equality does not working in 1.5.1

2015-11-06 Thread Seongduk Cheon
Hi Yanal! Yes, exactly. I read from csv file and convert to DF with column names. simply look like this. val eventDF = sc.textFile(eventFile).map(_.split(",")).filter(_.size >= 6) .map { e => // To do sometings }.toDF(eventTableColumns:_*).cache() The result of <=> function is

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi Ted, I was trying to set spark.sql.dialect to sql as to specify I only need “SQLContext” not HiveContext. It didn’t work. It still instantiate HiveContext. Since I don’t use HiveContext and I don’t want to start a mysql database because I want to have more than 1 session of spark-shell

spark 1.5.0 mllib lda eats up all the disk space

2015-11-06 Thread TheGeorge1918 .
Hi all, *PROBLEM:* I'm using spark 1.5.0 distributedLDA to do topic modelling. It looks like after 20 iterations, the whole disk space is exhausted and the application broke down. *DETAILS:* I'm using 4 m3.2xlarge (each has 30G memory and 2x80G disk space) machines as data nodes. I monitored

Re: Checkpointing an InputDStream from Kafka

2015-11-06 Thread Cody Koeninger
Have you looked at the driver and executor logs? Without being able to see what's in the "do stuff with the dstream" section of code... I'd suggest starting with a simpler job, e.g that does nothing but print each message, and verify whether it checkpoints On Fri, Nov 6, 2015 at 3:59 AM, Kathi

Re: very slow parquet file write

2015-11-06 Thread Rok Roskar
yes I was expecting that too because of all the metadata generation and compression. But I have not seen performance this bad for other parquet files I’ve written and was wondering if there could be something obvious (and wrong) to do with how I’ve specified the schema etc. It’s a very simple

Re: very slow parquet file write

2015-11-06 Thread Cheng Lian
On 11/6/15 10:53 PM, Rok Roskar wrote: yes I was expecting that too because of all the metadata generation and compression. But I have not seen performance this bad for other parquet files I’ve written and was wondering if there could be something obvious (and wrong) to do with how I’ve

[sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-06 Thread Dhaval Patel
I have been struggling through this error since past 3 days and have tried all possible ways/suggestions people have provided on stackoverflow and here in this group. I am trying to read a parquet file using sparkR and convert it into an R dataframe for further usage. The file size is not that

Re: Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-06 Thread Richard Hillegas
Hi Rajesh, The 1.6 schedule is available on the front page of the Spark wiki: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage. I don't know of any workarounds for this problem. Thanks, Rick Madabhattula Rajesh Kumar wrote on 11/05/2015 06:35:22 PM: >

Re: Looking for the method executors uses to write to HDFS

2015-11-06 Thread Reynold Xin
Are you looking for this? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L69 On Wed, Nov 4, 2015 at 5:11 AM, Tóth Zoltán wrote: > Hi, > > I'd like to write a parquet file from the

Is there anyway to do partition discovery without 'field=' in folder names?

2015-11-06 Thread Wei Chen
Hey Friends, I've been using partition discovery with folder structures that have "field=" in folder names. However, I've also encountered a lot of folders structures without "field=" in folder names, especially when it is year, month, day. Is there anyway that we can assign a field to each level

[Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
Hi spark users and developers, Is it possible to disable HiveContext from being instantiated when using spark-shell? I got the following errors when I have more than one session starts. Since I don't use HiveContext, it would be great if I can have more than 1 spark-shell start at the same time.

Re: creating a distributed index

2015-11-06 Thread swetha kasireddy
Hi Ankur, I have the following questions on IndexedRDD. 1. Does the IndexedRDD support the key types of String? As per the current documentation, it looks like it supports only Long? 2. Is IndexedRDD efficient when joined with another RDD. So, basically my usecase is that I need to create an

RE: [Spark R]could not allocate memory (2048 Mb) in C function 'R_AllocStringBuffer'

2015-11-06 Thread Sun, Rui
Hi,Todd, "--driver-memory" options specifies the maximum heap memory size of the JVM backend for SparkR. The error you faced is memory allocation error of your R process. They are different. I guess that 2G memory bound for a string is limitation of the R interpreter? That's the reason why we

Spark Streaming updateStateByKey Implementation

2015-11-06 Thread Hien Luu
Hi, I am interested in learning about the implementation of updateStateByKey. Does anyone know of a jira or design doc I read? I did a quick search and couldn't find much info. on the implementation. Thanks in advance, Hien

Spark SQL 'explode' command failing on AWS EC2 but succeeding locally

2015-11-06 Thread Anthony Rose
Hi all, I am using Spark SQL and I have a table stored in a Dataframe that I am trying to re-structure. I have an approach that works locally but when I try to run the same command on an AWS EC2 instance I get an error reporting that I have an 'unresolved operator' Basically I have data that

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
What is interesting is that pyspark shell works fine with multiple session in the same host even though multiple HiveContext has been created. What does pyspark does differently in terms of starting up the shell? > On Nov 6, 2015, at 12:12 PM, Ted Yu wrote: > > In

Re: Serializers problems maping RDDs to objects again

2015-11-06 Thread Iker Perez de Albeniz
I have seen that the problem is on the Geohash class that can not be picked.. but in groupByKey i use an other custom class an there is no problem... 2015-11-06 13:44 GMT+01:00 Iker Perez de Albeniz : > Hi All, > > I am new at this list. Before sending this mail i have

Serializers problems maping RDDs to objects again

2015-11-06 Thread Iker Perez de Albeniz
Hi All, I am new at this list. Before sending this mail i have searched on archive but i have not found a solution for me. i am using spark to process user locations based on RSSI. My spark script look like this.. text_files = sc.textFile(','.join(files[startime])) result =

Re: Fwd: Re: DataFrame equality does not working in 1.5.1

2015-11-06 Thread Michael Armbrust
In particular this is sounding like: https://issues.apache.org/jira/browse/SPARK-10859 On Fri, Nov 6, 2015 at 1:05 PM, Michael Armbrust wrote: > I would be great if you could try sql("SET > spark.sql.inMemoryColumnarStorage.partitionPruning=false") also, try Spark >

anyone using netlib-java with sparkR on yarn spark1.6?

2015-11-06 Thread Tom Graves
I'm trying to use the netlib-java stuff with mllib and sparkR on yarn. I've compiled with -Pnetlib-lgpl, see the necessary things in the spark assembly jar.  The nodes have  /usr/lib64/liblapack.so.3, /usr/lib64/libblas.so.3, and /usr/lib/libgfortran.so.3. Running:data <- read.df(sqlContext,

Re: Unable to register UDF with StructType

2015-11-06 Thread Michael Armbrust
This isn't supported today. On Fri, Nov 6, 2015 at 1:28 AM, Rishabh Bhardwaj wrote: > Thanks for your response. > Actually I want to return a Row or (return struct). > But, attempting to register UDF returning Row throws the following error, > > scala> def test(r:Row):Row =

Re: Fwd: Re: DataFrame equality does not working in 1.5.1

2015-11-06 Thread Michael Armbrust
I would be great if you could try sql("SET spark.sql.inMemoryColumnarStorage.partitionPruning=false") also, try Spark 1.5.2-RC2 On Fri, Nov 6, 2015 at 4:49 AM, Seongduk Cheon wrote: > Hi Yanal! > > Yes,

Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2015-11-06 Thread Kayode Odeyemi
Hi, I have a YARN hadoop setup of 8 nodes (7 datanodes, 1 namenode and resourcemaneger). I have Spark setup only on the namenode/resource manager. Do I need to have Spark installed on the datanodes? I asked because I'm getting below error when I run a Spark job through spark-submit: Error:

Re: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher

2015-11-06 Thread Ted Yu
You mentioned resourcemanager but not nodemanagers. I think you need to install Spark on nodes running nodemanagers. Cheers On Fri, Nov 6, 2015 at 1:32 PM, Kayode Odeyemi wrote: > Hi, > > I have a YARN hadoop setup of 8 nodes (7 datanodes, 1 namenode and > resourcemaneger).

bug: can not run Ipython notebook on cluster

2015-11-06 Thread Andy Davidson
Does anyone use iPython notebooks? I am able to use it on my local machine with spark how ever I can not get it work on my cluster. For unknown reason on my cluster I have to manually create the spark context. My test code generated this exception Exception: Python in worker has different