error creating custom schema

2015-12-23 Thread Divya Gehlot
Hi, I am trying to create custom schema but its throwing below error scala> import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.hive.HiveContext > > scala> import org.apache.spark.sql.hive.orc._ > import org.apache.spark.sql.hive.orc._ > > scala> val hiveContext = new

Re: Problem with Spark Standalone

2015-12-23 Thread luca_guerra
I don't think it's a "malformed IP address" issue because I have used an uri and not an IP. Another info, the master, driver and workers are hosted on the same machine so I use "localhost" as host for the Driver. -- View this message in context:

java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-23 Thread Vijay Gharge
Few indicators - 1) during execution time - check total number of open files using lsof command. Need root permissions. If it is cluster not sure much ! 2) which exact line in the code is triggering this error ? Can you paste that snippet ? On Wednesday 23 December 2015, Priya Ch

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Eran Witkon
Did you get a solution for this? On Tue, 22 Dec 2015 at 20:24 raja kbv wrote: > Hi, > > I am new to spark. > > I have a text file with below structure. > > > (employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, > Description, Duriation, Role}]}) >

Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-23 Thread Vyacheslav Yanuk
Hi. I have very strange situation with direct reading from Kafka. For example. I have 1000 messages in Kafka. After submitting my application I read this data and process it. As I process the data I have accumulated 10 new entries. In next reading from Kafka I read only 3 records, but not 10!!!

Re: rdd split into new rdd

2015-12-23 Thread Ted Yu
bq. {a=1, b=1, c=2, d=2} Can you elaborate your criteria a bit more ? The above seems to be a Set, not a Map. Cheers On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya wrote: > Hi, > > I have data > *JavaPairRDD> *format. In example: > > *(1610,

Unable to create hive table using HiveContext

2015-12-23 Thread Soni spark
Hi friends, I am trying to create hive table through spark with Java code in Eclipse using below code. HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc()); sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)"); but i am getting error RROR XBM0J:

Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
Thanks, so based on that article, should I use sbt or maven? Or either? Eran On Wed, 23 Dec 2015 at 13:05 Akhil Das wrote: > You will have to point to your spark-assembly.jar since spark has a lot of > dependencies. You can read the answers discussed over here to have

Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-23 Thread Priya Ch
ulimit -n 65000 fs.file-max = 65000 ( in etc/sysctl.conf file) Thanks, Padma Ch On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma wrote: > Could you share the ulimit for your setup please ? > > - Thanks, via mobile, excuse brevity. > On Dec 22, 2015 6:39 PM, "Priya Ch"

Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Vyacheslav Yanuk
Colleagues Documents written about createDirectStream that "This does not use Zookeeper to store offsets. The consumed offsets are tracked by the stream itself. For interoperability with Kafka monitoring tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself from the

Re: Problem with Spark Standalone

2015-12-23 Thread Vijay Gharge
Hi Luca Are you able to run these commands from scala REPL shell ? Atleast 1 round of iteration? Looking at the error which says remote end shutting down - i suspect some command in the code is triggering sc context shutdown or something similar. One more point - if both master and executer

Problem using limit clause in spark sql

2015-12-23 Thread tiandiwoxin1234
Hi, I am using spark sql in a way like this: sqlContext.sql(“select * from table limit 1”).map(...).collect() The problem is that the limit clause will collect all the 10,000 records into a single partition, resulting the map afterwards running only in one partition and being really slow.I

rdd split into new rdd

2015-12-23 Thread Yasemin Kaya
Hi, I have data *JavaPairRDD> *format. In example: *(1610, {a=1, b=1, c=2, d=2}) * I want to get *JavaPairRDD* In example: *(1610, {a, b})* *(1610, {c, d})* Is there a way to solve this problem? Best, yasemin -- hiç ender hiç

Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
Hi, I am using spark sql in a way like this: sqlContext.sql(“select * from table limit 1”).map(...).collect() The problem is that the limit clause will collect all the 10,000 records into a single partition, resulting the map afterwards running only in one partition and being really slow.I

Re: rdd split into new rdd

2015-12-23 Thread Stéphane Verlet
You should be able to do that using mapPartition On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu wrote: > bq. {a=1, b=1, c=2, d=2} > > Can you elaborate your criteria a bit more ? The above seems to be a Set, > not a Map. > > Cheers > > On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya

Re: Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Cody Koeninger
Read the documentation spark.apache.org/docs/latest/streaming-kafka-integration.html If you still have questions, read the resources linked from https://github.com/koeninger/kafka-exactly-once On Wed, Dec 23, 2015 at 7:24 AM, Vyacheslav Yanuk wrote: > Colleagues >

DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Christopher Brady
The documentation for DataFrameWriter.format(String) says: "Specifies the underlying output data source. Built-in options include "parquet", "json", etc." What options are there other than parquet and json? From googling I found "com.databricks.spark.avro", but that doesn't seem to work

How to call mapPartitions on DataFrame?

2015-12-23 Thread unk1102
Hi I have the following code where I use mapPartitions on RDD but then I need to convert it into DataFrame so why do I need to convert DataFrame into RDD and back into DataFrame for just calling mapPartitions why can I call it directly on DataFrame? sourceFrame.toJavaRDD().mapPartitions(new

Re: Do existing R packages work with SparkR data frames

2015-12-23 Thread Felix Cheung
Hi SparkR has some support for machine learning algorithm like glm. For existing R packages, currently you would need to collect to convert into R data.frame - assuming it fits into the memory of the driver node, though that would be required to work with R package in any case.

Re: rdd split into new rdd

2015-12-23 Thread Yasemin Kaya
How can i use mapPartion? Could u give me an example? 2015-12-23 17:26 GMT+02:00 Stéphane Verlet : > You should be able to do that using mapPartition > > On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu wrote: > >> bq. {a=1, b=1, c=2, d=2} >> >> Can you

Using Java Function API with Java 8

2015-12-23 Thread rdpratti
I am trying to pass lambda expressions to Spark JavaRDD methods. Having using lambda expressions in Java, in general, I was hoping for similar behavour and coding patterns, but am finding confusing compile errors. The use case is a lambda expression that has a number of statements, returning a

Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread Hokam Singh Chauhan
Hi, Use spark://hostname:7077 as spark master if you are using IP address in place of hostname. I have faced the same issue, it got resolved by using hostname in spark master instead of using IP address. Regards, Hokam On 23 Dec 2015 13:41, "Akhil Das" wrote: > You

Re: Problem using limit clause in spark sql

2015-12-23 Thread Zhan Zhang
There has to have a central point to collaboratively collecting exactly 1 records, currently the approach is using one single partitions, which is easy to implement. Otherwise, the driver has to count the number of records in each partition and then decide how many records to be

Re: Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
I see. Thanks. > 在 2015年12月24日,上午11:44,Zhan Zhang 写道: > > There has to have a central point to collaboratively collecting exactly 1 > records, currently the approach is using one single partitions, which is easy > to implement. > Otherwise, the driver has to

Re: Problem using limit clause in spark sql

2015-12-23 Thread Gaurav Agarwal
I am going to have the above scenario without using limit clause then will it work check among all the partitions. On Dec 24, 2015 9:26 AM, "汪洋" wrote: > I see. > > Thanks. > > > 在 2015年12月24日,上午11:44,Zhan Zhang 写道: > > There has to have a central

?????? Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread ????????????
Hi Hokam, Thank you very much. Your approach really works after I set hostname/IP in the Windows hosts file. However, new error information comes out. I think it's very common as I have seen such information in many places. Here's part of information from Eclipse console.

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Yanbo Liang
If you want to use CSV format, please refer the spark-csv project and the examples. https://github.com/databricks/spark-csv 2015-12-24 4:40 GMT+08:00 Zhan Zhang : > Now json, parquet, orc(in hivecontext), text are natively supported. If > you use avro or others, you have

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
Both are similar, give both a go and choose the one you like. On Dec 23, 2015 7:55 PM, "Eran Witkon" wrote: > Thanks, so based on that article, should I use sbt or maven? Or either? > Eran > On Wed, 23 Dec 2015 at 13:05 Akhil Das wrote: > >> You

Re: Missing dependencies when submitting scala app

2015-12-23 Thread Daniel Valdivia
Hi Jeff, The problem was I was pulling json4s 3.3.0 which seems to have some problem with spark, I switched to 3.2.11 and everything is fine now On Tue, Dec 22, 2015 at 5:36 PM, Jeff Zhang wrote: > It might be jar conflict issue. Spark has dependency org.json4s.jackson, > do

Re: Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
It is an application running as an http server. So I collect the data as the response. > 在 2015年12月24日,上午8:22,Hudong Wang 写道: > > When you call collect() it will bring all the data to the driver. Do you mean > to call persist() instead? > > From:

Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-23 Thread Antony Mayi
fyi after further troubleshooting logging this as  https://issues.apache.org/jira/browse/SPARK-12511 On Tuesday, 22 December 2015, 18:16, Antony Mayi wrote: I narrowed it down to problem described for example here: 

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Gokula Krishnan D
You can try this .. But slightly modified the input structure since first two columns were not in Json format. [image: Inline image 1] Thanks & Regards, Gokula Krishnan* (Gokul)* On Wed, Dec 23, 2015 at 9:46 AM, Eran Witkon wrote: > Did you get a solution for this? > >

Re: Unable to create hive table using HiveContext

2015-12-23 Thread Zhan Zhang
You are using embedded mode, which will create the db locally (in your case, maybe the db has been created, but you do not have right permission?). To connect to remote metastore, hive-site.xml has to be correctly configured. Thanks. Zhan Zhang On Dec 23, 2015, at 7:24 AM, Soni spark

Re: rdd split into new rdd

2015-12-23 Thread Stéphane Verlet
I use Scala , but I guess in Java code would like this JavaPairRDD> rdd ... JavaPairRDD rdd2 = rdd.mapPartitionsToPair(function , true) where function implements PairFlatMapFunction>,String, List>

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Zhan Zhang
Now json, parquet, orc(in hivecontext), text are natively supported. If you use avro or others, you have to include the package, which are not built into spark jar. Thanks. Zhan Zhang On Dec 23, 2015, at 8:57 AM, Christopher Brady

RE: How to Parse & flatten JSON object in a text file using Spark into Dataframe

2015-12-23 Thread Bharathi Raja
Thanks Gokul, but the file I have had the same format as I have mentioned. First two columns are not in Json format. Thanks, Raja -Original Message- From: "Gokula Krishnan D" Sent: ‎12/‎24/‎2015 2:44 AM To: "Eran Witkon" Cc: "raja kbv"

RE: How to Parse & flatten JSON object in a text file using Spark into Dataframe

2015-12-23 Thread Bharathi Raja
Hi Eran, I didn't get the solution yet. Thanks, Raja -Original Message- From: "Eran Witkon" Sent: ‎12/‎23/‎2015 8:17 PM To: "raja kbv" ; "user@spark.apache.org" Subject: Re: How to Parse & flatten JSON object in a text

Re: error creating custom schema

2015-12-23 Thread Ted Yu
Looks like a comma was missing after "C1" Cheers > On Dec 23, 2015, at 1:47 AM, Divya Gehlot wrote: > > Hi, > I am trying to create custom schema but its throwing below error > > >> scala> import org.apache.spark.sql.hive.HiveContext >> import

Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
Thanks, all of these examples shows how to link to spark source and build it as part of my project. why should I do that? why not point directly to my spark.jar? Am I missing something? Eran On Wed, Dec 23, 2015 at 9:59 AM Akhil Das wrote: > 1. Install sbt plugin on

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
You will have to point to your spark-assembly.jar since spark has a lot of dependencies. You can read the answers discussed over here to have a better understanding http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits Thanks Best Regards On Wed, Dec 23, 2015 at 4:27 PM,

Re: Problem with Spark Standalone

2015-12-23 Thread luca_guerra
This is the master's log file: 15/12/22 03:23:05 ERROR FileAppender: Error writing stream to file /disco1/spark-1.5.1/work/app-20151222032252-0010/0/stderr java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162) at

Re: Streaming json records from kafka ... how can I process ... help please :)

2015-12-23 Thread Akhil
Akhil wrote > You can do it like this: > > lines.foreachRDD(jsonRDD =>{ > > val data = sqlContext.read.json(jsonRDD) > data.registerTempTable("mytable") > sqlContext.sql("SELECT * FROM mytable") > > }) See

Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
1. Install sbt plugin on IntelliJ 2. Create a new project/Import an sbt project like Dean suggested 3. Happy Debugging. You can also refer to this article for more information https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ Thanks Best

Re: Streaming json records from kafka ... how can I process ... help please :)

2015-12-23 Thread Gideon
What you wrote is inaccurate. When you create a directkafkastream what happens is that you actually create DirectKafkaInputDStream. This DirectKafkaInputDStream extends a DStream. 2 functions that a DStream has are: map and print when you map on your DirectKafkaInputDStream what you're actually

Re: I coded an example to use Twitter stream as a data source for Spark

2015-12-23 Thread Amir Rahnama
Thats my goal brother. But lets agree spark is not a very straight forward repo to get yourself started. I have got some initiaö code though. On Wednesday, 23 December 2015, Akhil Das wrote: > Why not create a custom dstream >

Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread Akhil Das
You need to: 1. Make sure your local router have NAT enabled and port forwarded the networking ports listed here . 2. Make sure on your clusters 7077 is accessible from your local (public) ip address. You can try telnet