Getting NPE when trying to do spark streaming with Twitter

2016-04-10 Thread krisgari
I am new to SparkStreaming, when tried to submit the Spark-Twitter streaming job, getting the following error: --- Lost task 0.0 in stage 0.0 (TID 0,sandbox.hortonworks.com):java.lang.NullPointerException at org.apache.spark.util.Utils$.decodeFileNameInURI(Utils.scala:340) at

High GC time when setting custom input partitions

2016-04-10 Thread Johnny W.
Hi spark-user, I am using spark 1.6 to build reverse index for one month of twitter data (~50GB). The split size of HDFS is 1GB, thus by default sc.textFile creates 50 partitions. I'd like to increase the parallelism by increase the number of input partitions. Thus, I use textFile(..., 200) to

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Buntu Dev
Thanks Ted for the input. I was able to get it working with pyspark shell but the same job submitted via 'spark-submit' using client or cluster deploy mode ends up with these errors: ~ java.lang.OutOfMemoryError: Java heap space at java.lang.Object.clone(Native Method) at

Connection closed Exception.

2016-04-10 Thread Bijay Kumar Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the following things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table

Re: Datasets combineByKey

2016-04-10 Thread Koert Kuipers
yes it is On Apr 10, 2016 3:17 PM, "Amit Sela" wrote: > I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking > for, makes sense ? > > On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote: > >> I'm mapping RDD API to Datasets API and I

Fwd: Connection closed Exception.

2016-04-10 Thread Bijay Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table

Connection closed Exception.

2016-04-10 Thread Bijay Pathak
Hi, I am running Spark 1.6 on EMR. I have workflow which does the fiollowing things: 1. Read the 2 flat file, create the data frame and join it. 2. Read the particular partition from the hive table and joins the dataframe from 1 with it. 3. Finally, insert overwrite into hive table

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart
Your solution works in hive, but not in spark, even if I use hive context. I tried to create a temp table and then this query: - sqlContext.sql("insert into table myTable select * from myTable_temp”) But I still get the same error. thanks From: Mich Talebzadeh

Multiple folders to SqlContext

2016-04-10 Thread KhajaAsmath Mohammed
Hi, I am looking on how to add multiple folders to spark context and then make it as a dataframe. Lets say I have below folder /daas/marts/US/file1.txt /daas/marts/CH/file2.txt /daas/marts/SG/file3.txt. Above files have same schema. I dont want to create multiple dataframes instead create only

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Mich Talebzadeh
Hi, I am confining myself to Hive tables. As I stated it before I have not tried it in Spark. So I stand corrected. Let us try this simple test in Hive -- Create table hive> *create table testme(col1 int);*OK --insert a row hive> *insert into testme values(1);* Loading data to table

Re: Datasets combineByKey

2016-04-10 Thread Amit Sela
I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking for, makes sense ? On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote: > I'm mapping RDD API to Datasets API and I was wondering if I was missing > something or is this functionality is missing. > > >

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Maurin Lenglart
Hi, So basically you are telling me that I need to recreate a table, and re-insert everything every time I update a column? I understand the constraints, but that solution doesn’t look good to me. I am updating the schema everyday and the table is a couple of TB of data. Do you see any other

Re: Sqoop on Spark

2016-04-10 Thread Jörn Franke
I am not 100% sure, but you could export to CSV in Oracle using external tables. Oracle has also the Hadoop Loader, which seems to support Avro. However, I think you need to buy the Big Data solution. > On 10 Apr 2016, at 16:12, Mich Talebzadeh wrote: > > Yes I

Infinite recursion in createDataFrame for avro types

2016-04-10 Thread Brad Cox
I'm getting a StackOverflowError from inside the createDataFrame call in this example. It originates in scala code involving java type inferencing which calls itself in an infinite loop. final EventParser parser = new EventParser(); JavaRDD eventRDD = sc.textFile(path) .map(new

How Application jar is copied to worker machines?

2016-04-10 Thread Hemalatha A
Hello, I want to know on doing spark-submit, how is the Application jar copied to worker machines? Who does the copying of Jars? Similarly who copies DAG from driver to executors? -- Regards Hemalatha

Re: Weird error while serialization

2016-04-10 Thread Ted Yu
Have you considered using PairRDDFunctions.aggregateByKey or PairRDDFunctions.reduceByKey in place of the groupBy to achieve better performance ? Cheers On Sat, Apr 9, 2016 at 2:00 PM, SURAJ SHETH wrote: > Hi, > I am using Spark 1.5.2 > > The file contains 900K rows each

Re: Only 60% of Total Spark Batch Application execution time spent in Task Processing

2016-04-10 Thread Ted Yu
Jasmine: Let's know if listening to more events would give you better picture. Thanks On Thu, Apr 7, 2016 at 1:54 PM, Jasmine George wrote: > Hi Ted, > > > > Thanks for replying so fast. > > > > We are using spark 1.5.2. > > I was collecting only TaskEnd Events. > > I can

Re: Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Vikash Pareek
Hi Talebzadeh, Thank for your quick response. >>in 1.6, how many executors do you see for each node? I have1 executor for 1 node with SPARK_WORKER_INSTANCES=1. >>in standalone mode how are you increasing the number of worker instances. Are you starting another slave on each node? No, I am not

Re: Sqoop on Spark

2016-04-10 Thread Mich Talebzadeh
Yes I meant MR. Again one cannot beat the RDBMS export utility. I was specifically referring to Oracle in above case that does not provide any specific text bases export except the binary one Exp, data pump etc). In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) that can be

Re: Sqoop on Spark

2016-04-10 Thread Michael Segel
Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) The largest problem with sqoop is that in order to gain parallelism you need to know how your underlying table is partitioned and to do multiple range queries. This may not be known, or your data may or may not be equally

Re: Datasets combineByKey

2016-04-10 Thread Amit Sela
I'm mapping RDD API to Datasets API and I was wondering if I was missing something or is this functionality is missing. On Sun, Apr 10, 2016 at 3:00 PM Ted Yu wrote: > Haven't found any JIRA w.r.t. combineByKey for Dataset. > > What's your use case ? > > Thanks > > On Sat,

RE: Unable run Spark in YARN mode

2016-04-10 Thread Yu, Yucai
Could you follow this guide http://spark.apache.org/docs/latest/running-on-yarn.html#configuration? Thanks, Yucai -Original Message- From: maheshmath [mailto:mahesh.m...@gmail.com] Sent: Saturday, April 9, 2016 1:58 PM To: user@spark.apache.org Subject: Unable run Spark in YARN mode I

LinkedIn streams in Spark

2016-04-10 Thread Deepak Sharma
Hello All, I am looking for a use case where anyone have used spark streaming integration with LinkedIn. -- Thanks Deepak

Re: Datasets combineByKey

2016-04-10 Thread Ted Yu
Haven't found any JIRA w.r.t. combineByKey for Dataset. What's your use case ? Thanks On Sat, Apr 9, 2016 at 7:38 PM, Amit Sela wrote: > Is there (planned ?) a combineByKey support for Dataset ? > Is / Will there be a support for combiner lifting ? > > Thanks, > Amit >

Re: Graphframes pattern causing java heap space errors

2016-04-10 Thread Ted Yu
Looks like the exception occurred on driver. Consider increasing the values for the following config: conf.set("spark.driver.memory", "10240m") conf.set("spark.driver.maxResultSize", "2g") Cheers On Sat, Apr 9, 2016 at 9:02 PM, Buntu Dev wrote: > I'm running it via

Re: alter table add columns aternatives or hive refresh

2016-04-10 Thread Mich Talebzadeh
I have not tried it on Spark but the column added in Hive to an existing table cannot be updated for existing rows. In other words the new column is set to null which does not require the change in the existing file length. So basically as I understand when a column is added to an already table.

Re: Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Mich Talebzadeh
Hi, in 1.6, how many executors do you see for each node? in standalone mode how are you increasing the number of worker instances. Are you starting another slave on each node? HTH Dr Mich Talebzadeh LinkedIn *

Number of executors in spark-1.6 and spark-1.5

2016-04-10 Thread Vikash Pareek
Hi, I have upgraded 5 node spark cluster from spark-1.5 to spark-1.6 (to use mapWithState function). After using spark-1.6, I am getting a strange behaviour of spark, jobs are not using multiple executors of different nodes at a time means there is no parallel processing if each node having