Re: Joining a RDD to a Dataframe

2016-05-08 Thread Cyril Scetbon
Hi Ashish, The issue is not related to converting a RDD to a DF. I did it. I was just asking if I should do it differently. The issue regards the exception when using array_contains with a sql.Column instead of a value. I found another way to do it using explode as follows :

How big the spark stream window could be ?

2016-05-08 Thread kramer2...@126.com
We have some stream data need to be calculated and considering use spark stream to do it. We need to generate three kinds of reports. The reports are based on 1. The last 5 minutes data 2. The last 1 hour data 3. The last 24 hour data The frequency of reports is 5 minutes. After reading the

partitioner aware subtract

2016-05-08 Thread Raghava Mutharaju
Hello All, We have two PairRDDs (rdd1, rdd2) which are hash partitioned on key (number of partitions are same for both the RDDs). We would like to subtract rdd2 from rdd1. The subtract code at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala seems to

Re: Joining a RDD to a Dataframe

2016-05-08 Thread Ashish Dubey
Is there any reason you dont want to convert this - i dont think join b/w RDD and DF is supported. On Sat, May 7, 2016 at 11:41 PM, Cyril Scetbon wrote: > Hi, > > I have a RDD built during a spark streaming job and I'd like to join it to > a DataFrame (E/S input) to

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-08 Thread Ashish Dubey
I see the behavior - so it always goes with min total tasks possible on your settings ( num-executors * num-cores ) - however if you use a huge amount of data then you will see more tasks - that means it has some kind of lower bound on num-tasks.. It may require some digging. other formats did not

Re: BlockManager crashing applications

2016-05-08 Thread Ashish Dubey
1. Caused by: java.io.IOException: Failed to connect to ip-10-12-46-235.us-west-2.compute.internal/10.12.46.235:55681 2. at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216) 3. at

Re: BlockManager crashing applications

2016-05-08 Thread Brandon White
I'm not quite sure how this is a memory problem. There are no OOM exceptions and the job only breaks when actions are ran in parallel, submitted to the scheduler by different threads. The issue is that the doGetRemote function does not retry when it is denied access to a cache block. On May 8,

Re: BlockManager crashing applications

2016-05-08 Thread Ashish Dubey
Brandon, how much memory are you giving to your executors - did you check if there were dead executors in your application logs.. Most likely you require higher memory for executors.. Ashish On Sun, May 8, 2016 at 1:01 PM, Brandon White wrote: > Hello all, > > I am

Re: Parse Json in Spark

2016-05-08 Thread Ashish Dubey
This limit is due to underlying inputFormat implementation. you can always write your own inputFormat and then use spark newAPIHadoopFile api to pass your inputFormat class path. You will have to place the jar file in /lib location on all the nodes.. Ashish On Sun, May 8, 2016 at 4:02 PM,

Re: Parse Json in Spark

2016-05-08 Thread Hyukjin Kwon
I remember this Jira, https://issues.apache.org/jira/browse/SPARK-7366. Parsing multiple lines are not supported in Json fsta source. Instead this can be done by sc.wholeTextFiles(). I found some examples here, http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

Parse Json in Spark

2016-05-08 Thread KhajaAsmath Mohammed
Hi, I am working on parsing the json in spark but most of the information available online states that I need to have entire JSON in single line. In my case, Json file is delivered in complex structure and not in a single line. could anyone know how to process this in SPARK. I used Jackson jar

BlockManager crashing applications

2016-05-08 Thread Brandon White
Hello all, I am running a Spark application which schedules multiple Spark jobs. Something like: val df = sqlContext.read.parquet("/path/to/file") filterExpressions.par.foreach { expression => df.filter(expression).count() } When the block manager fails to fetch a block, it throws an

different SqlContext with same udf name with different meaning

2016-05-08 Thread Igor Berman
Hi, suppose I have multitenant environment and I want to give my users additional functions but for each user/tenant the meaning of same function is dependent on user's specific configuration is it possible to register same function several times under different SqlContexts? are several

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
Yes my mistake. I am using Spark 1.5.2 not 2.x. I looked at running spark driver jvm process on linux. Looks like my settings are not being applied to driver. We use oozie spark action to launch spark. I will have to investigate more on that. hopefully spark is or have replaced memory killer

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Ted Yu
See the following: [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming I guess you meant you are using Spark 1.5.1 For the time being, consider increasing spark.driver.memory Cheers On Sun, May 8, 2016 at 9:14 AM, Nirav Patel wrote: > Yes, I am using yarn

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
Yes, I am using yarn client mode hence I specified am settings too. What you mean akka is moved out of picture? I am using spark 2.5.1 Sent from my iPhone > On May 8, 2016, at 6:39 AM, Ted Yu wrote: > > Are you using YARN client mode ? > > See >

Re: Correct way of setting executor numbers and executor cores in Spark 1.6.1 for non-clustered mode ?

2016-05-08 Thread Mich Talebzadeh
Hi Karen, You mentioned: "So if I'm reading your email correctly it sounds like I should be able to increase the number of executors on local mode by adding hostnames for localhost. and cores per executor with SPARK_EXECUTOR_CORES. And by starting master/slave(s) for local host I can access

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Ted Yu
Are you using YARN client mode ? See https://spark.apache.org/docs/latest/running-on-yarn.html In cluster mode, spark.yarn.am.memory is not effective. For Spark 2.0, akka is moved out of the picture. FYI On Sat, May 7, 2016 at 8:24 PM, Nirav Patel wrote: > I have 20

Re: Is it a bug?

2016-05-08 Thread Ted Yu
I don't think so. RDD is immutable. > On May 8, 2016, at 2:14 AM, Sisyphuss wrote: > > > > > > -- > View this message in context: >

Is it a bug?

2016-05-08 Thread Sisyphuss
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-a-bug-tp26898.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
Thanks Davies, after I did a coalesce(1) to save as single parquet file I was able to get the head() to return the correct order. On Sun, May 8, 2016 at 12:29 AM, Davies Liu wrote: > When you have multiple parquet files, the order of all the rows in > them is not defined.

Re: pyspark dataframe sort issue

2016-05-08 Thread Davies Liu
When you have multiple parquet files, the order of all the rows in them is not defined. On Sat, May 7, 2016 at 11:48 PM, Buntu Dev wrote: > I'm using pyspark dataframe api to sort by specific column and then saving > the dataframe as parquet file. But the resulting parquet

Re: sqlCtx.read.parquet yields lots of small tasks

2016-05-08 Thread Johnny W.
The file size is very small (< 1M). The stage launches every time i call: -- sqlContext.read.parquet(path_to_file) These are the parquet specific configurations I set: -- spark.sql.parquet.filterPushdown: true spark.sql.parquet.mergeSchema: true Thanks, J. On Sat, May 7, 2016 at 4:20 PM, Ashish

pyspark dataframe sort issue

2016-05-08 Thread Buntu Dev
I'm using pyspark dataframe api to sort by specific column and then saving the dataframe as parquet file. But the resulting parquet file doesn't seem to be sorted. Applying sort and doing a head() on the results shows the correct results sorted by 'value' column in desc order, as shown below:

Joining a RDD to a Dataframe

2016-05-08 Thread Cyril Scetbon
Hi, I have a RDD built during a spark streaming job and I'd like to join it to a DataFrame (E/S input) to enrich it. It seems that I can't join the RDD and the DF without converting first the RDD to a DF (Tell me if I'm wrong). Here are the schemas of both DF : scala> df res32: