Re: Spark streaming from Kafka best fit

2016-03-07 Thread pratik khadloya
Would using mapPartitions instead of map help here? ~Pratik On Tue, Mar 1, 2016 at 10:07 AM Cody Koeninger wrote: > You don't need an equal number of executor cores to partitions. An > executor can and will work on multiple partitions within a batch, one after > the other.

Spark job stuck with 0 input records

2015-11-14 Thread pratik khadloya
Hello, We are running spark on yarn version 1.4.1 java.vendor=Oracle Corporation java.runtime.version=1.7.0_40-b43 datanucleus-core-3.2.10.jar datanucleus-api-jdo-3.2.6.jar datanucleus-rdbms-3.2.9.jar IndexIDAttemptStatusLocality LevelExecutor ID / HostLaunch TimeDuration ▾GC TimeInput Size /

Re: Querying nested struct fields

2015-11-10 Thread pratik khadloya
om agg_imps_df limit 10").collect() > > On Tue, Nov 10, 2015 at 11:24 AM, pratik khadloya <tispra...@gmail.com> > wrote: > >> Hello, >> >> I just saved a PairRDD as a table, but i am not able to query it >> correctly. The below and other variations doe

Querying nested struct fields

2015-11-10 Thread pratik khadloya
Hello, I just saved a PairRDD as a table, but i am not able to query it correctly. The below and other variations does not seem to work. scala> hc.sql("select * from agg_imps_df").printSchema() |-- _1: struct (nullable = true) ||-- item_id: long (nullable = true) ||-- flight_id: long

Re: Querying nested struct fields

2015-11-10 Thread pratik khadloya
uot;) > sql("SELECT `_1`.`_1` FROM test") > > On Tue, Nov 10, 2015 at 11:31 AM, pratik khadloya <tispra...@gmail.com> > wrote: > >> I tried the same, didn't work :( >> >> scala> hc.sql("select _1.item_id from agg_imps_df limit 10").collect

PairRDD from SQL

2015-11-04 Thread pratik khadloya
Hello, Is it possible to have a pair RDD from the below SQL query. The pair being ((item_id, flight_id), metric1) item_id, flight_id are part of group by. SELECT item_id, flight_id, SUM(metric1) AS metric1 FROM mytable GROUP BY item_id, flight_id Thanks, Pratik

Re: Huge shuffle data size

2015-10-23 Thread pratik khadloya
uot;day_hour") <=> aggRevenueDf("day_hour") && aggImpsDf("day_hour_2") <=> aggRevenueDf("day_hour_2"), "inner") .select( aggImpsDf("id_1"), aggImpsDf("id_2"), aggImpsDf("day_hour"), aggImpsDf(

Re: Saprk error:- Not a valid DFS File name

2015-10-23 Thread pratik khadloya
Check what you have at SimpleMktDataFlow.scala:106 ~Pratik On Fri, Oct 23, 2015 at 11:47 AM kali.tumm...@gmail.com < kali.tumm...@gmail.com> wrote: > Full Error:- > at > > org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:195) > at > >

Re: Saprk error:- Not a valid DFS File name

2015-10-23 Thread pratik khadloya
I had face a similar issue. The actual problem was not in the file name. We run Spark on Yarn. The actual problem was seen in the logs by running the command: $ yarn logs -applicationId Scroll from the beginning to know the actual error. ~Pratik On Fri, Oct 23, 2015 at 11:40 AM

Re: Huge shuffle data size

2015-10-23 Thread pratik khadloya
by. ~Pratik On Fri, Oct 23, 2015 at 1:38 PM Kartik Mathur <kar...@bluedata.com> wrote: > Don't use groupBy , use reduceByKey instead , groupBy should always be > avoided as it leads to lot of shuffle reads/writes. > > On Fri, Oct 23, 2015 at 11:39 AM, pratik khadloya <tispra.

Huge shuffle data size

2015-10-23 Thread pratik khadloya
Hello, Data about my spark job is below. My source data is only 916MB (stage 0) and 231MB (stage 1), but when i join the two data sets (stage 2) it takes a very long time and as i see the shuffled data is 614GB. Is this something expected? Both the data sets produce 200 partitions. Stage

Re: Stream are not serializable

2015-10-23 Thread pratik khadloya
You might be referring to some class level variables from your code. I got to see the actual field which caused the error when i marked the class as serializable and run it on cluster. class MyClass extends java.io.Serializable The following resources will also help: