error creating custom schema

2015-12-23 Thread Divya Gehlot
Hi,
I am trying to create custom schema but its throwing below error


scala> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.hive.HiveContext
>
> scala> import org.apache.spark.sql.hive.orc._
> import org.apache.spark.sql.hive.orc._
>
> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
> 15/12/23 04:42:09 WARN SparkConf: The configuration key
> 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark
> 1.3 and and may be removed in the future. Please use the new key
> 'spark.yarn.am.waitTime' instead.
> 15/12/23 04:42:09 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> hiveContext: org.apache.spark.sql.hive.HiveContext =
> org.apache.spark.sql.hive.HiveContext@3ca50ddf
>
> scala> import org.apache.spark.sql.types.{StructType, StructField,
> StringType, IntegerType,FloatType ,LongType ,TimestampType };
> import org.apache.spark.sql.types.{StructType, StructField, StringType,
> IntegerType, FloatType, LongType, TimestampType}
>
> scala> val loandepoSchema = StructType(Seq(
>  | StructField("C1" StringType, true),
>  | StructField("COLUMN2", StringType , true),
>  | StructField("COLUMN3", StringType, true),
>  | StructField("COLUMN4", StringType, true),
>  | StructField("COLUMN5", StringType , true),
>  | StructField("COLUMN6", StringType, true),
>  | StructField("COLUMN7", StringType, true),
>  | StructField("COLUMN8", StringType, true),
>  | StructField("COLUMN9", StringType, true),
>  | StructField("COLUMN10", StringType, true),
>  | StructField("COLUMN11", StringType, true),
>  | StructField("COLUMN12", StringType, true),
>  | StructField("COLUMN13", StringType, true),
>  | StructField("COLUMN14", StringType, true),
>  | StructField("COLUMN15", StringType, true),
>  | StructField("COLUMN16", StringType, true),
>  | StructField("COLUMN17", StringType, true)
>  | StructField("COLUMN18", StringType, true),
>  | StructField("COLUMN19", StringType, true),
>  | StructField("COLUMN20", StringType, true),
>  | StructField("COLUMN21", StringType, true),
>  | StructField("COLUMN22", StringType, true)))
> :25: error: value StringType is not a member of String
>StructField("C1" StringType, true),
> ^
>

Would really appreciate the guidance/pointers.

Thanks,
Divya


Re: Problem with Spark Standalone

2015-12-23 Thread luca_guerra
I don't think it's a "malformed IP address" issue because I have used an uri
and not an IP.
Another info, the master, driver and workers are hosted on the same machine
so I use "localhost" as host for the Driver.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-Spark-Standalone-tp25750p25787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-23 Thread Vijay Gharge
Few indicators -

1) during execution time - check total number of open files using lsof
command. Need root permissions. If it is cluster not sure much !
2) which exact line in the code is triggering this error ? Can you paste
that snippet ?

On Wednesday 23 December 2015, Priya Ch > wrote:

> ulimit -n 65000
>
> fs.file-max = 65000 ( in etc/sysctl.conf file)
>
> Thanks,
> Padma Ch
>
> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>
>> Could you share the ulimit for your setup please ?
>>
>> - Thanks, via mobile,  excuse brevity.
>> On Dec 22, 2015 6:39 PM, "Priya Ch"  wrote:
>>
>>> Jakob,
>>>
>>>Increased the settings like fs.file-max in /etc/sysctl.conf and also
>>> increased user limit in /etc/security/limits.conf. But still see the
>>> same issue.
>>>
>>> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
>>> wrote:
>>>
 It might be a good idea to see how many files are open and try
 increasing the open file limit (this is done on an os level). In some
 application use-cases it is actually a legitimate need.

 If that doesn't help, make sure you close any unused files and streams
 in your code. It will also be easier to help diagnose the issue if you send
 an error-reproducing snippet.

>>>
>>>
>

-- 
Regards,
Vijay Gharge


Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Eran Witkon
Did you get a solution for this?
On Tue, 22 Dec 2015 at 20:24 raja kbv  wrote:

> Hi,
>
> I am new to spark.
>
> I have a text file with below structure.
>
>
> (employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName,
> Description, Duriation, Role}]})
> Eg:
> (123456, Employee1, {“ProjectDetails”:[
>  { “ProjectName”:
> “Web Develoement”, “Description” : “Online Sales website”, “Duration” : “6
> Months” , “Role” : “Developer”}
>  { “ProjectName”:
> “Spark Develoement”, “Description” : “Online Sales Analysis”, “Duration” :
> “6 Months” , “Role” : “Data Engineer”}
>  { “ProjectName”:
> “Scala Training”, “Description” : “Training”, “Duration” : “1 Month” }
>   ]
> }
>
>
> Could someone help me to parse & flatten the record as below dataframe
> using scala?
>
> employeeID,Name, ProjectName, Description, Duration, Role
> 123456, Employee1, Web Develoement, Online Sales website, 6 Months ,
> Developer
> 123456, Employee1, Spark Develoement, Online Sales Analysis, 6 Months,
> Data Engineer
> 123456, Employee1, Scala Training, Training, 1 Month, null
>
>
> Thank you in advance.
>
> Regards,
> Raja
>


Spark Streaming 1.5.2+Kafka+Python. Strange reading

2015-12-23 Thread Vyacheslav Yanuk
Hi.
I have very strange situation with direct reading from Kafka.
For example.
I have 1000 messages in Kafka.
After submitting my application I read this data and process it.
As I process the data I have accumulated 10 new entries.
In next reading from Kafka I read only 3 records, but not 10!!!
Why???
I don't understand...
Explain to me please!

-- 
WBR, Vyacheslav Yanuk
Codeminders.com


Re: rdd split into new rdd

2015-12-23 Thread Ted Yu
bq. {a=1, b=1, c=2, d=2}

Can you elaborate your criteria a bit more ? The above seems to be a Set,
not a Map.

Cheers

On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya  wrote:

> Hi,
>
> I have data
> *JavaPairRDD> *format. In example:
>
> *(1610, {a=1, b=1, c=2, d=2}) *
>
> I want to get
> *JavaPairRDD* In example:
>
>
> *(1610, {a, b})*
> *(1610, {c, d})*
>
> Is there a way to solve this problem?
>
> Best,
> yasemin
> --
> hiç ender hiç
>


Unable to create hive table using HiveContext

2015-12-23 Thread Soni spark
Hi friends,

I am trying to create hive table through spark with Java code in Eclipse
using below code.

HiveContext sqlContext = new org.apache.spark.sql.hive.HiveContext(sc.sc());
   sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");


but i am getting error

RROR XBM0J: Directory /home/workspace4/Test/metastore_db already exists.

I am not sure why metastore creating in workspace. Please help me.

Thanks
Soniya


Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
Thanks, so based on that article, should I use sbt or maven? Or either?
Eran
On Wed, 23 Dec 2015 at 13:05 Akhil Das  wrote:

> You will have to point to your spark-assembly.jar since spark has a lot of
> dependencies. You can read the answers discussed over here to have a better
> understanding
> http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits
>
> Thanks
> Best Regards
>
> On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon  wrote:
>
>> Thanks, all of these examples shows how to link to spark source and build
>> it as part of my project. why should I do that? why not point directly to
>> my spark.jar?
>> Am I missing something?
>> Eran
>>
>> On Wed, Dec 23, 2015 at 9:59 AM Akhil Das 
>> wrote:
>>
>>> 1. Install sbt plugin on IntelliJ
>>> 2. Create a new project/Import an sbt project like Dean suggested
>>> 3. Happy Debugging.
>>>
>>> You can also refer to this article for more information
>>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon 
>>> wrote:
>>>
 Any pointers how to use InteliJ for spark development?
 Any way to use scala worksheet run like spark- shell?

>>>
>>>
>


Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2015-12-23 Thread Priya Ch
ulimit -n 65000

fs.file-max = 65000 ( in etc/sysctl.conf file)

Thanks,
Padma Ch

On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:

> Could you share the ulimit for your setup please ?
>
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch"  wrote:
>
>> Jakob,
>>
>>Increased the settings like fs.file-max in /etc/sysctl.conf and also
>> increased user limit in /etc/security/limits.conf. But still see the
>> same issue.
>>
>> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
>> wrote:
>>
>>> It might be a good idea to see how many files are open and try
>>> increasing the open file limit (this is done on an os level). In some
>>> application use-cases it is actually a legitimate need.
>>>
>>> If that doesn't help, make sure you close any unused files and streams
>>> in your code. It will also be easier to help diagnose the issue if you send
>>> an error-reproducing snippet.
>>>
>>
>>


Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Vyacheslav Yanuk
Colleagues
Documents written about  createDirectStream that

"This does not use Zookeeper to store offsets. The consumed offsets are
tracked by the stream itself. For interoperability with Kafka monitoring
tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself
from the streaming application. You can access the offsets used in each
batch from the generated RDDs (see   "

My question is.
How I can access the offsets used in each batch ???
What I should SEE???

-- 
WBR, Vyacheslav Yanuk
Codeminders.com


Re: Problem with Spark Standalone

2015-12-23 Thread Vijay Gharge
Hi Luca

Are you able to run these commands from scala REPL shell ? Atleast 1 round
of iteration?

Looking at the error which says remote end shutting down - i suspect some
command in the code is triggering sc context shutdown or something similar.

One more point - if both master and executer both are on same server why
dont you paste complete code without any modification as well as spark gui
screenshot during the same time.

As asked in other mail - did this setup worked anytime in past ?

On Wednesday 23 December 2015, luca_guerra  wrote:

> I don't think it's a "malformed IP address" issue because I have used an
> uri
> and not an IP.
> Another info, the master, driver and workers are hosted on the same machine
> so I use "localhost" as host for the Driver.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-Spark-Standalone-tp25750p25787.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>

-- 
Regards,
Vijay Gharge


Problem using limit clause in spark sql

2015-12-23 Thread tiandiwoxin1234
Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records
into a single partition, resulting the map afterwards running only in one
partition and being really slow.I tried to use repartition, but it is kind
of a waste to collect all those records into one partition and then shuffle
them around and then collect them again.

Is there a way to work around this? 
BTW, there is no order by clause and I do not care which 1 records I get
as long as the total number is less or equal then 1.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-using-limit-clause-in-spark-sql-tp25789.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



rdd split into new rdd

2015-12-23 Thread Yasemin Kaya
Hi,

I have data
*JavaPairRDD> *format. In example:

*(1610, {a=1, b=1, c=2, d=2}) *

I want to get
*JavaPairRDD* In example:


*(1610, {a, b})*
*(1610, {c, d})*

Is there a way to solve this problem?

Best,
yasemin
-- 
hiç ender hiç


Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this? 
BTW, there is no order by clause and I do not care which 1 records I get as 
long as the total number is less or equal then 1.

Re: rdd split into new rdd

2015-12-23 Thread Stéphane Verlet
You should be able to do that using mapPartition

On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu  wrote:

> bq. {a=1, b=1, c=2, d=2}
>
> Can you elaborate your criteria a bit more ? The above seems to be a Set,
> not a Map.
>
> Cheers
>
> On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya  wrote:
>
>> Hi,
>>
>> I have data
>> *JavaPairRDD> *format. In example:
>>
>> *(1610, {a=1, b=1, c=2, d=2}) *
>>
>> I want to get
>> *JavaPairRDD* In example:
>>
>>
>> *(1610, {a, b})*
>> *(1610, {c, d})*
>>
>> Is there a way to solve this problem?
>>
>> Best,
>> yasemin
>> --
>> hiç ender hiç
>>
>
>


Re: Spark Streaming 1.5.2+Kafka+Python (docs)

2015-12-23 Thread Cody Koeninger
Read the documentation
spark.apache.org/docs/latest/streaming-kafka-integration.html
If you still have questions, read the resources linked from
https://github.com/koeninger/kafka-exactly-once

On Wed, Dec 23, 2015 at 7:24 AM, Vyacheslav Yanuk 
wrote:

> Colleagues
> Documents written about  createDirectStream that
>
> "This does not use Zookeeper to store offsets. The consumed offsets are
> tracked by the stream itself. For interoperability with Kafka monitoring
> tools that depend on Zookeeper, you have to update Kafka/Zookeeper yourself
> from the streaming application. You can access the offsets used in each
> batch from the generated RDDs (see   "
>
> My question is.
> How I can access the offsets used in each batch ???
> What I should SEE???
>
> --
> WBR, Vyacheslav Yanuk
> Codeminders.com
>


DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Christopher Brady

The documentation for DataFrameWriter.format(String) says:
"Specifies the underlying output data source. Built-in options include 
"parquet", "json", etc."


What options are there other than parquet and json? From googling I 
found "com.databricks.spark.avro", but that doesn't seem to work 
correctly for me. (It always hangs at the end.) Can I use CSV? Can I use 
sequence files?


How to call mapPartitions on DataFrame?

2015-12-23 Thread unk1102
Hi I have the following code where I use mapPartitions on RDD but then I need
to convert it into DataFrame so why do I need to convert DataFrame into RDD
and back into DataFrame for just calling mapPartitions why can I call it
directly on DataFrame? 

sourceFrame.toJavaRDD().mapPartitions(new
FlatMapFunction,Row>() {

   @Override 
   public Iterable  call(Iterable rowIterator) throws Exception { 
List rowAsList = new ArrayList<>(); 
while(rowIterator.hasNext()) { 
  Row row = rowIterator.next();
  rowAsList = iterate(JavaConversions.seqAsJavaList(row.toSeq())); 
  Row updatedRow = RowFactory.create(rowAsList.toArray()); 
  rowAsList.add(updatedRow);
} 
return rowAsList; 
   } 


When I see method signature it
is.mapPartitions(scala.Function1,Iterator> f,ClassTag
evidence$5)

How to I map above code into dataframe.mapPartitions please guide I am new
to Spark.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-call-mapPartitions-on-DataFrame-tp25791.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Do existing R packages work with SparkR data frames

2015-12-23 Thread Felix Cheung
Hi
SparkR has some support for machine learning algorithm like glm.
For existing R packages, currently you would need to collect to convert into R 
data.frame - assuming it fits into the memory of the driver node, though that 
would be required to work with R package in any case.



_
From: Lan 
Sent: Tuesday, December 22, 2015 4:50 PM
Subject: Do existing R packages work with SparkR data frames
To:  


   Hello,   

 Is it possible for existing R Machine Learning packages (which work with R   
 data frames) such as bnlearn, to work with SparkR data frames? Or do I need   
 to convert SparkR data frames to R data frames? Is "collect" the function to   
 do the conversion, or how else to do that?   

 Many Thanks,   
 Lan   



 --   
 View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Do-existing-R-packages-work-with-SparkR-data-frames-tp25772.html
   
 Sent from the Apache Spark User List mailing list archive at Nabble.com.   

 -   
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org   
 For additional commands, e-mail: user-h...@spark.apache.org   

   


  

Re: rdd split into new rdd

2015-12-23 Thread Yasemin Kaya
How can i use mapPartion? Could u give me an example?

2015-12-23 17:26 GMT+02:00 Stéphane Verlet :

> You should be able to do that using mapPartition
>
> On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu  wrote:
>
>> bq. {a=1, b=1, c=2, d=2}
>>
>> Can you elaborate your criteria a bit more ? The above seems to be a Set,
>> not a Map.
>>
>> Cheers
>>
>> On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya  wrote:
>>
>>> Hi,
>>>
>>> I have data
>>> *JavaPairRDD> *format. In example:
>>>
>>> *(1610, {a=1, b=1, c=2, d=2}) *
>>>
>>> I want to get
>>> *JavaPairRDD* In example:
>>>
>>>
>>> *(1610, {a, b})*
>>> *(1610, {c, d})*
>>>
>>> Is there a way to solve this problem?
>>>
>>> Best,
>>> yasemin
>>> --
>>> hiç ender hiç
>>>
>>
>>
>


-- 
hiç ender hiç


Using Java Function API with Java 8

2015-12-23 Thread rdpratti
I am trying to pass lambda expressions to Spark JavaRDD methods.

Having using lambda expressions in Java, in general, I was hoping for
similar behavour and coding patterns, but am finding confusing compile
errors.

The use case is a lambda expression that has a number of statements,
returning a boolean from various points in the logic. 

I have tried both inline, as well as defining a Function functional type
with no luck.

Here is an example:

Function checkHeaders2 = x -> {if
(x.startsWith("npi")||x.startsWith("CPT"))

return new Boolean(false);

else new Boolean(true); };   

This code gets an error stating that method must return a Boolean.

I know that the lambda expression can be shortened and included as a simple
one statement return, but using non-Spark Java 8 and a Predicate functional
type this would compile and be usable.  

What am I missing and how to use the Spark Function to define lambda
exressions made up of mutliple Java statements.

Thanks

rd



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-Java-Function-API-with-Java-8-tp25794.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread Hokam Singh Chauhan
Hi,

Use spark://hostname:7077 as spark master if you are using IP address in
place of hostname.

I have faced the same issue, it got resolved by using hostname in spark
master instead of using IP address.

Regards,
Hokam
On 23 Dec 2015 13:41, "Akhil Das"  wrote:

> You need to:
>
> 1. Make sure your local router have NAT enabled and port forwarded the
> networking ports listed here
> .
> 2. Make sure on your clusters 7077 is accessible from your local (public)
> ip address. You can try telnet 10.20.17.70 7077
> 3. Set spark.driver.host so that the cluster can connect back to your
> machine.
>
>
>
> Thanks
> Best Regards
>
> On Wed, Dec 23, 2015 at 10:02 AM, superbee84  wrote:
>
>> Hi All,
>>
>>I'm new to Spark. Before I describe the problem, I'd like to let you
>> know
>> the role of the machines that organize the cluster and the purpose of my
>> work. By reading and follwing the instructions and tutorials, I
>> successfully
>> built up a cluster with 7 CentOS-6.5 machines. I installed Hadoop 2.7.1,
>> Spark 1.5.1, Scala 2.10.4 and ZooKeeper 3.4.5 on them. The details are
>> listed as below:
>>
>>
>> Host Name  |  IP Address  |  Hadoop 2.7.1 | Spark 1.5.1|
>> ZooKeeper
>> hadoop00   | 10.20.17.70  | NameNode(Active)   | Master(Active)   |   none
>> hadoop01   | 10.20.17.71  | NameNode(Standby)| Master(Standby) |   none
>> hadoop02   | 10.20.17.72  | ResourceManager(Active)| none  |
>>  none
>> hadoop03   | 10.20.17.73  | ResourceManager(Standby)| none|  none
>> hadoop04   | 10.20.17.74  | DataNode  |  Worker  |
>> JournalNode
>> hadoop05   | 10.20.17.75  | DataNode  |  Worker  |
>> JournalNode
>> hadoop06   | 10.20.17.76  | DataNode  |  Worker  |
>> JournalNode
>>
>>Now my *purpose* is to develop Hadoop/Spark applications on my own
>> computer(IP: 10.20.6.23) and submit them to the remote cluster. As all the
>> other guys in our group are in the habit of eclipse on Windows, I'm trying
>> to work on this. I have successfully submitted the WordCount MapReduce job
>> to YARN and it run smoothly through eclipse and Windows. But when I tried
>> to
>> run the Spark WordCount, it gives me the following error in the eclipse
>> console:
>>
>> 15/12/23 11:15:30 INFO AppClient$ClientEndpoint: Connecting to master
>> spark://10.20.17.70:7077...
>> 15/12/23 11:15:50 ERROR SparkUncaughtExceptionHandler: Uncaught exception
>> in
>> thread Thread[appclient-registration-retry-thread,5,main]
>> java.util.concurrent.RejectedExecutionException: Task
>> java.util.concurrent.FutureTask@29ed85e7 rejected from
>> java.util.concurrent.ThreadPoolExecutor@28f21632[Running, pool size = 1,
>> active threads = 0, queued tasks = 0, completed tasks = 1]
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown
>> Source)
>> at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
>> at java.util.concurrent.AbstractExecutorService.submit(Unknown
>> Source)
>> at
>>
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
>> at
>>
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>> at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>> at
>>
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95)
>> at
>>
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121)
>> at
>>
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132)
>> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
>> at
>>
>> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source)
>> at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
>> at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown
>> Source)
>> at
>>
>> 

Re: Problem using limit clause in spark sql

2015-12-23 Thread Zhan Zhang
There has to have a central point to collaboratively collecting exactly 1 
records, currently the approach is using one single partitions, which is easy 
to implement.
Otherwise, the driver has to count the number of records in each partition and 
then decide how many records  to be materialized in each partition, because 
some partition may not have enough number of records, sometimes it is even 
empty.

I didn’t see any straightforward walk around for this.

Thanks.

Zhan Zhang



On Dec 23, 2015, at 5:32 PM, 汪洋 
> wrote:

It is an application running as an http server. So I collect the data as the 
response.

在 2015年12月24日,上午8:22,Hudong Wang 
> 写道:

When you call collect() it will bring all the data to the driver. Do you mean 
to call persist() instead?


From: tiandiwo...@icloud.com
Subject: Problem using limit clause in spark sql
Date: Wed, 23 Dec 2015 21:26:51 +0800
To: user@spark.apache.org

Hi,
I am using spark sql in a way like this:

sqlContext.sql(“select * from table limit 1”).map(...).collect()

The problem is that the limit clause will collect all the 10,000 records into a 
single partition, resulting the map afterwards running only in one partition 
and being really slow.I tried to use repartition, but it is kind of a waste to 
collect all those records into one partition and then shuffle them around and 
then collect them again.

Is there a way to work around this?
BTW, there is no order by clause and I do not care which 1 records I get as 
long as the total number is less or equal then 1.




Re: Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
I see.  

Thanks.


> 在 2015年12月24日,上午11:44,Zhan Zhang  写道:
> 
> There has to have a central point to collaboratively collecting exactly 1 
> records, currently the approach is using one single partitions, which is easy 
> to implement. 
> Otherwise, the driver has to count the number of records in each partition 
> and then decide how many records  to be materialized in each partition, 
> because some partition may not have enough number of records, sometimes it is 
> even empty.
> 
> I didn’t see any straightforward walk around for this.
> 
> Thanks.
> 
> Zhan Zhang
> 
> 
> 
> On Dec 23, 2015, at 5:32 PM, 汪洋  > wrote:
> 
>> It is an application running as an http server. So I collect the data as the 
>> response.
>> 
>>> 在 2015年12月24日,上午8:22,Hudong Wang >> > 写道:
>>> 
>>> When you call collect() it will bring all the data to the driver. Do you 
>>> mean to call persist() instead?
>>> 
>>> From: tiandiwo...@icloud.com 
>>> Subject: Problem using limit clause in spark sql
>>> Date: Wed, 23 Dec 2015 21:26:51 +0800
>>> To: user@spark.apache.org 
>>> 
>>> Hi,
>>> I am using spark sql in a way like this:
>>> 
>>> sqlContext.sql(“select * from table limit 1”).map(...).collect()
>>> 
>>> The problem is that the limit clause will collect all the 10,000 records 
>>> into a single partition, resulting the map afterwards running only in one 
>>> partition and being really slow.I tried to use repartition, but it is kind 
>>> of a waste to collect all those records into one partition and then shuffle 
>>> them around and then collect them again.
>>> 
>>> Is there a way to work around this? 
>>> BTW, there is no order by clause and I do not care which 1 records I 
>>> get as long as the total number is less or equal then 1.
>> 
> 



Re: Problem using limit clause in spark sql

2015-12-23 Thread Gaurav Agarwal
I am going to have the above scenario without using limit clause then will
it work check among all the partitions.
On Dec 24, 2015 9:26 AM, "汪洋"  wrote:

> I see.
>
> Thanks.
>
>
> 在 2015年12月24日,上午11:44,Zhan Zhang  写道:
>
> There has to have a central point to collaboratively collecting exactly
> 1 records, currently the approach is using one single partitions, which
> is easy to implement.
> Otherwise, the driver has to count the number of records in each partition
> and then decide how many records  to be materialized in each partition,
> because some partition may not have enough number of records, sometimes it
> is even empty.
>
> I didn’t see any straightforward walk around for this.
>
> Thanks.
>
> Zhan Zhang
>
>
>
> On Dec 23, 2015, at 5:32 PM, 汪洋  wrote:
>
> It is an application running as an http server. So I collect the data as
> the response.
>
> 在 2015年12月24日,上午8:22,Hudong Wang  写道:
>
> When you call collect() it will bring all the data to the driver. Do you
> mean to call persist() instead?
>
> --
> From: tiandiwo...@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
>
> Hi,
> I am using spark sql in a way like this:
>
> sqlContext.sql(“select * from table limit 1”).map(...).collect()
>
> The problem is that the limit clause will collect all the 10,000 records
> into a single partition, resulting the map afterwards running only in one
> partition and being really slow.I tried to use repartition, but it is
> kind of a waste to collect all those records into one partition and then
> shuffle them around and then collect them again.
>
> Is there a way to work around this?
> BTW, there is no order by clause and I do not care which 1 records I
> get as long as the total number is less or equal then 1.
>
>
>
>
>


?????? Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread ????????????
Hi Hokam,


Thank you very much. Your approach really works after I set hostname/IP in 
the Windows hosts file. However, new error information comes out. I think it's 
very common as I have seen such information in many places. 
Here's part of information from Eclipse console.


15/12/24 11:59:08 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151224105757-0002/92 on hostPort 10.20.17.74:44097 with 4 cores, 1024.0 
MB RAM
15/12/24 11:59:08 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/92 is now LOADING
15/12/24 11:59:08 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/92 is now RUNNING
15/12/24 11:59:12 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 11:59:27 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 11:59:42 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 11:59:57 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 12:00:12 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 12:00:27 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 12:00:42 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 12:00:57 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources
15/12/24 12:01:08 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/90 is now EXITED (Command exited with code 1)
15/12/24 12:01:08 INFO SparkDeploySchedulerBackend: Executor 
app-20151224105757-0002/90 removed: Command exited with code 1
15/12/24 12:01:08 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 90
15/12/24 12:01:08 INFO AppClient$ClientEndpoint: Executor added: 
app-20151224105757-0002/93 on worker-20151221140040-10.20.17.76-33817 
(10.20.17.76:33817) with 4 cores
15/12/24 12:01:08 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151224105757-0002/93 on hostPort 10.20.17.76:33817 with 4 cores, 1024.0 
MB RAM
15/12/24 12:01:08 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/93 is now LOADING
15/12/24 12:01:08 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/93 is now RUNNING
15/12/24 12:01:09 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/91 is now EXITED (Command exited with code 1)
15/12/24 12:01:09 INFO SparkDeploySchedulerBackend: Executor 
app-20151224105757-0002/91 removed: Command exited with code 1
15/12/24 12:01:09 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 91
15/12/24 12:01:09 INFO AppClient$ClientEndpoint: Executor added: 
app-20151224105757-0002/94 on worker-20151221140040-10.20.17.75-47807 
(10.20.17.75:47807) with 4 cores
15/12/24 12:01:09 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151224105757-0002/94 on hostPort 10.20.17.75:47807 with 4 cores, 1024.0 
MB RAM
15/12/24 12:01:09 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/94 is now LOADING
15/12/24 12:01:09 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/94 is now RUNNING
15/12/24 12:01:10 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/92 is now EXITED (Command exited with code 1)
15/12/24 12:01:10 INFO SparkDeploySchedulerBackend: Executor 
app-20151224105757-0002/92 removed: Command exited with code 1
15/12/24 12:01:10 INFO SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 92
15/12/24 12:01:10 INFO AppClient$ClientEndpoint: Executor added: 
app-20151224105757-0002/95 on worker-20151221193318-10.20.17.74-44097 
(10.20.17.74:44097) with 4 cores
15/12/24 12:01:10 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20151224105757-0002/95 on hostPort 10.20.17.74:44097 with 4 cores, 1024.0 
MB RAM
15/12/24 12:01:10 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/95 is now LOADING
15/12/24 12:01:10 INFO AppClient$ClientEndpoint: Executor updated: 
app-20151224105757-0002/95 is now RUNNING
15/12/24 12:01:12 WARN TaskSchedulerImpl: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources

Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Yanbo Liang
If you want to use CSV format, please refer the spark-csv project and the
examples.
https://github.com/databricks/spark-csv


2015-12-24 4:40 GMT+08:00 Zhan Zhang :

> Now json, parquet, orc(in hivecontext), text are natively supported. If
> you use avro or others, you have to include the package, which are not
> built into spark jar.
>
> Thanks.
>
> Zhan Zhang
>
> On Dec 23, 2015, at 8:57 AM, Christopher Brady <
> christopher.br...@oracle.com> wrote:
>
> DataFrameWriter.format
>
>
>


Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
Both are similar, give both a go and choose the one you like.
On Dec 23, 2015 7:55 PM, "Eran Witkon"  wrote:

> Thanks, so based on that article, should I use sbt or maven? Or either?
> Eran
> On Wed, 23 Dec 2015 at 13:05 Akhil Das  wrote:
>
>> You will have to point to your spark-assembly.jar since spark has a lot
>> of dependencies. You can read the answers discussed over here to have a
>> better understanding
>> http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon 
>> wrote:
>>
>>> Thanks, all of these examples shows how to link to spark source and
>>> build it as part of my project. why should I do that? why not point
>>> directly to my spark.jar?
>>> Am I missing something?
>>> Eran
>>>
>>> On Wed, Dec 23, 2015 at 9:59 AM Akhil Das 
>>> wrote:
>>>
 1. Install sbt plugin on IntelliJ
 2. Create a new project/Import an sbt project like Dean suggested
 3. Happy Debugging.

 You can also refer to this article for more information
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ

 Thanks
 Best Regards

 On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon 
 wrote:

> Any pointers how to use InteliJ for spark development?
> Any way to use scala worksheet run like spark- shell?
>


>>


Re: Missing dependencies when submitting scala app

2015-12-23 Thread Daniel Valdivia
Hi Jeff,

The problem was I was pulling json4s 3.3.0 which seems to have some problem
with spark, I switched to 3.2.11 and everything is fine now

On Tue, Dec 22, 2015 at 5:36 PM, Jeff Zhang  wrote:

> It might be jar conflict issue. Spark has dependency org.json4s.jackson,
> do you also specify org.json4s.jackson in your sbt dependency but with a
> different version ?
>
> On Wed, Dec 23, 2015 at 6:15 AM, Daniel Valdivia 
> wrote:
>
>> Hi,
>>
>> I'm trying to figure out how to bundle dependendies with a scala
>> application, so far my code was tested successfully on the spark-shell
>> however now that I'm trying to run it as a stand alone application which
>> I'm compilin with sbt is yielding me the error:
>>
>>
>> *java.lang.NoSuchMethodError:
>> org.json4s.jackson.JsonMethods$.parse$default$3()Z at
>> ClusterIncidents$$anonfun$1.apply(ClusterInciden*
>>
>> I'm doing "sbt clean package" and then spark-submit of the resulting jar,
>> however seems like either my driver or workers don't have the json4s
>> dependency, therefor can't find the parse method
>>
>> Any idea on how to solve this depdendency problem?
>>
>> thanks in advance
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Re: Problem using limit clause in spark sql

2015-12-23 Thread 汪洋
It is an application running as an http server. So I collect the data as the 
response.

> 在 2015年12月24日,上午8:22,Hudong Wang  写道:
> 
> When you call collect() it will bring all the data to the driver. Do you mean 
> to call persist() instead?
> 
> From: tiandiwo...@icloud.com
> Subject: Problem using limit clause in spark sql
> Date: Wed, 23 Dec 2015 21:26:51 +0800
> To: user@spark.apache.org
> 
> Hi,
> I am using spark sql in a way like this:
> 
> sqlContext.sql(“select * from table limit 1”).map(...).collect()
> 
> The problem is that the limit clause will collect all the 10,000 records into 
> a single partition, resulting the map afterwards running only in one 
> partition and being really slow.I tried to use repartition, but it is kind of 
> a waste to collect all those records into one partition and then shuffle them 
> around and then collect them again.
> 
> Is there a way to work around this? 
> BTW, there is no order by clause and I do not care which 1 records I get 
> as long as the total number is less or equal then 1.



Re: driver OOM due to io.netty.buffer items not getting finalized

2015-12-23 Thread Antony Mayi
fyi after further troubleshooting logging this as 
https://issues.apache.org/jira/browse/SPARK-12511 

On Tuesday, 22 December 2015, 18:16, Antony Mayi  
wrote:
 
 

 I narrowed it down to problem described for example here: 
https://bugs.openjdk.java.net/browse/JDK-6293787
It is the mass finalization of zip Inflater/Deflater objects which can't keep 
up with the rate of these instances being garbage collected. as the jdk bug 
report (not being accepted as a bug) suggests this is an error of suboptimal 
destruction of the instances.
Not sure where the zip comes from - for all the compressors used in spark I was 
using the default snappy codec.
I am trying to disable all the spark.*.compress options and so far it seems 
this has dramatically improved, the finalization looks to be keeping up and the 
heap is stable.
Any input is still welcome! 

On Tuesday, 22 December 2015, 12:17, Ted Yu  wrote:
 
 

 This might be related but the jmap output there looks different:
http://stackoverflow.com/questions/32537965/huge-number-of-io-netty-buffer-poolthreadcachememoryregioncacheentry-instances

On Tue, Dec 22, 2015 at 2:59 AM, Antony Mayi  
wrote:

I have streaming app (pyspark 1.5.2 on yarn) that's crashing due to driver (jvm 
part, not python) OOM (no matter how big heap is assigned, eventually runs out).
When checking the heap it is all taken by "byte" items of 
io.netty.buffer.PoolThreadCache. The number of 
io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry is constant yet the 
number of [B "bytes" keeps growing as well as the number of Finalizer 
instances. When checking the Finalizer instances it is all of 
ZipFile$ZipFileInputStream and ZipFile$ZipFileInflaterInputStream
 num     #instances         #bytes  class 
name--   1:        123556      
278723776  [B   2:        258988       10359520  java.lang.ref.Finalizer   3:   
     174620        9778720  java.util.zip.Deflater   4:         66684        
7468608  org.apache.spark.executor.TaskMetrics   5:         80070        
7160112  [C   6:        282624        6782976  
io.netty.buffer.PoolThreadCache$MemoryRegionCache$Entry   7:        206371      
  4952904  java.lang.Long
the platform is using netty 3.6.6 and openjdk 1.8 (tried on 1.7 as well with 
same issue).
would anyone have a clue how to troubleshoot further?
thx.



 
   

 
  

Re: How to Parse & flatten JSON object in a text file using Spark & Scala into Dataframe

2015-12-23 Thread Gokula Krishnan D
You can try this .. But slightly modified the  input structure since first
two columns were not in Json format.

[image: Inline image 1]

Thanks & Regards,
Gokula Krishnan* (Gokul)*

On Wed, Dec 23, 2015 at 9:46 AM, Eran Witkon  wrote:

> Did you get a solution for this?
>
> On Tue, 22 Dec 2015 at 20:24 raja kbv  wrote:
>
>> Hi,
>>
>> I am new to spark.
>>
>> I have a text file with below structure.
>>
>>
>> (employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName,
>> Description, Duriation, Role}]})
>> Eg:
>> (123456, Employee1, {“ProjectDetails”:[
>>  {
>> “ProjectName”: “Web Develoement”, “Description” : “Online Sales website”,
>> “Duration” : “6 Months” , “Role” : “Developer”}
>>  {
>> “ProjectName”: “Spark Develoement”, “Description” : “Online Sales
>> Analysis”, “Duration” : “6 Months” , “Role” : “Data Engineer”}
>>  {
>> “ProjectName”: “Scala Training”, “Description” : “Training”, “Duration” :
>> “1 Month” }
>>   ]
>> }
>>
>>
>> Could someone help me to parse & flatten the record as below dataframe
>> using scala?
>>
>> employeeID,Name, ProjectName, Description, Duration, Role
>> 123456, Employee1, Web Develoement, Online Sales website, 6 Months ,
>> Developer
>> 123456, Employee1, Spark Develoement, Online Sales Analysis, 6 Months,
>> Data Engineer
>> 123456, Employee1, Scala Training, Training, 1 Month, null
>>
>>
>> Thank you in advance.
>>
>> Regards,
>> Raja
>>
>


Re: Unable to create hive table using HiveContext

2015-12-23 Thread Zhan Zhang
You are using embedded mode, which will create the db locally (in your case, 
maybe the db has been created, but you do not have right permission?).

To connect to remote metastore, hive-site.xml has to be correctly configured.

Thanks.

Zhan Zhang


On Dec 23, 2015, at 7:24 AM, Soni spark 
> wrote:

Hi friends,

I am trying to create hive table through spark with Java code in Eclipse using 
below code.

HiveContext sqlContext = new 
org.apache.spark.sql.hive.HiveContext(sc.sc());
   sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)");


but i am getting error

RROR XBM0J: Directory /home/workspace4/Test/metastore_db already exists.

I am not sure why metastore creating in workspace. Please help me.

Thanks
Soniya



Re: rdd split into new rdd

2015-12-23 Thread Stéphane Verlet
I use Scala , but I guess in Java  code  would like this

JavaPairRDD> rdd ...

JavaPairRDD rdd2 = rdd.mapPartitionsToPair(function ,
true)

where function implements
   PairFlatMapFunction>,String,
List>

Iterable>
call(java.util.Iterator> Ite){

// Iterate over your treemaps and generate your lists

}






On Wed, Dec 23, 2015 at 10:49 AM, Yasemin Kaya  wrote:

> How can i use mapPartion? Could u give me an example?
>
> 2015-12-23 17:26 GMT+02:00 Stéphane Verlet :
>
>> You should be able to do that using mapPartition
>>
>> On Wed, Dec 23, 2015 at 8:24 AM, Ted Yu  wrote:
>>
>>> bq. {a=1, b=1, c=2, d=2}
>>>
>>> Can you elaborate your criteria a bit more ? The above seems to be a
>>> Set, not a Map.
>>>
>>> Cheers
>>>
>>> On Wed, Dec 23, 2015 at 7:11 AM, Yasemin Kaya  wrote:
>>>
 Hi,

 I have data
 *JavaPairRDD> *format. In example:

 *(1610, {a=1, b=1, c=2, d=2}) *

 I want to get
 *JavaPairRDD* In example:


 *(1610, {a, b})*
 *(1610, {c, d})*

 Is there a way to solve this problem?

 Best,
 yasemin
 --
 hiç ender hiç

>>>
>>>
>>
>
>
> --
> hiç ender hiç
>


Re: DataFrameWriter.format(String) is there a list of options?

2015-12-23 Thread Zhan Zhang
Now json, parquet, orc(in hivecontext), text are natively supported. If you use 
avro or others, you have to include the package, which are not built into spark 
jar.

Thanks.

Zhan Zhang

On Dec 23, 2015, at 8:57 AM, Christopher Brady 
> wrote:

DataFrameWriter.format



RE: How to Parse & flatten JSON object in a text file using Spark into Dataframe

2015-12-23 Thread Bharathi Raja
Thanks Gokul, but the file I have had the same format as I have mentioned. 
First two columns are not in Json format.

Thanks,
Raja

-Original Message-
From: "Gokula Krishnan D" 
Sent: ‎12/‎24/‎2015 2:44 AM
To: "Eran Witkon" 
Cc: "raja kbv" ; "user@spark.apache.org" 

Subject: Re: How to Parse & flatten JSON object in a text file using Spark 
 into Dataframe

You can try this .. But slightly modified the  input structure since first two 
columns were not in Json format. 






Thanks & Regards, 
Gokula Krishnan (Gokul)


On Wed, Dec 23, 2015 at 9:46 AM, Eran Witkon  wrote:

Did you get a solution for this?


On Tue, 22 Dec 2015 at 20:24 raja kbv  wrote:

Hi,


I am new to spark.


I have a text file with below structure.


 
(employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, 
Description, Duriation, Role}]})
Eg:
(123456, Employee1, {“ProjectDetails”:[
 { “ProjectName”: “Web 
Develoement”, “Description” : “Online Sales website”, “Duration” : “6 Months” , 
“Role” : “Developer”}
 { “ProjectName”: 
“Spark Develoement”, “Description” : “Online Sales Analysis”, “Duration” : “6 
Months” , “Role” : “Data Engineer”}
 { “ProjectName”: 
“Scala Training”, “Description” : “Training”, “Duration” : “1 Month” }
  ]
}
 
 
Could someone help me to parse & flatten the record as below dataframe using 
scala?
 
employeeID,Name, ProjectName, Description, Duration, Role
123456, Employee1, Web Develoement, Online Sales website, 6 Months , Developer
123456, Employee1, Spark Develoement, Online Sales Analysis, 6 Months, Data 
Engineer
123456, Employee1, Scala Training, Training, 1 Month, null
 


Thank you in advance.


Regards,
Raja
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: How to Parse & flatten JSON object in a text file using Spark into Dataframe

2015-12-23 Thread Bharathi Raja
Hi Eran, I didn't get the solution yet. 

Thanks,
Raja

-Original Message-
From: "Eran Witkon" 
Sent: ‎12/‎23/‎2015 8:17 PM
To: "raja kbv" ; "user@spark.apache.org" 

Subject: Re: How to Parse & flatten JSON object in a text file using Spark 
 into Dataframe

Did you get a solution for this?

On Tue, 22 Dec 2015 at 20:24 raja kbv  wrote:

Hi,


I am new to spark.


I have a text file with below structure.


 
(employeeID: Int, Name: String, ProjectDetails: JsonObject{[{ProjectName, 
Description, Duriation, Role}]})
Eg:
(123456, Employee1, {“ProjectDetails”:[
 { “ProjectName”: “Web 
Develoement”, “Description” : “Online Sales website”, “Duration” : “6 Months” , 
“Role” : “Developer”}
 { “ProjectName”: 
“Spark Develoement”, “Description” : “Online Sales Analysis”, “Duration” : “6 
Months” , “Role” : “Data Engineer”}
 { “ProjectName”: 
“Scala Training”, “Description” : “Training”, “Duration” : “1 Month” }
  ]
}
 
 
Could someone help me to parse & flatten the record as below dataframe using 
scala?
 
employeeID,Name, ProjectName, Description, Duration, Role
123456, Employee1, Web Develoement, Online Sales website, 6 Months , Developer
123456, Employee1, Spark Develoement, Online Sales Analysis, 6 Months, Data 
Engineer
123456, Employee1, Scala Training, Training, 1 Month, null
 


Thank you in advance.


Regards,
Raja

Re: error creating custom schema

2015-12-23 Thread Ted Yu
Looks like a comma was missing after "C1"

Cheers

> On Dec 23, 2015, at 1:47 AM, Divya Gehlot  wrote:
> 
> Hi,
> I am trying to create custom schema but its throwing below error 
> 
> 
>> scala> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.hive.HiveContext
>> 
>> scala> import org.apache.spark.sql.hive.orc._
>> import org.apache.spark.sql.hive.orc._
>> 
>> scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> 15/12/23 04:42:09 WARN SparkConf: The configuration key 
>> 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 
>> and and may be removed in the future. Please use the new key 
>> 'spark.yarn.am.waitTime' instead.
>> 15/12/23 04:42:09 INFO HiveContext: Initializing execution hive, version 
>> 0.13.1
>> hiveContext: org.apache.spark.sql.hive.HiveContext = 
>> org.apache.spark.sql.hive.HiveContext@3ca50ddf
>> 
>> scala> import org.apache.spark.sql.types.{StructType, StructField, 
>> StringType, IntegerType,FloatType ,LongType ,TimestampType };
>> import org.apache.spark.sql.types.{StructType, StructField, StringType, 
>> IntegerType, FloatType, LongType, TimestampType}
>> 
>> scala> val loandepoSchema = StructType(Seq(
>>  | StructField("C1" StringType, true),
>>  | StructField("COLUMN2", StringType , true),
>>  | StructField("COLUMN3", StringType, true),
>>  | StructField("COLUMN4", StringType, true),
>>  | StructField("COLUMN5", StringType , true),
>>  | StructField("COLUMN6", StringType, true),
>>  | StructField("COLUMN7", StringType, true),
>>  | StructField("COLUMN8", StringType, true),
>>  | StructField("COLUMN9", StringType, true),
>>  | StructField("COLUMN10", StringType, true),
>>  | StructField("COLUMN11", StringType, true),
>>  | StructField("COLUMN12", StringType, true),
>>  | StructField("COLUMN13", StringType, true),
>>  | StructField("COLUMN14", StringType, true),
>>  | StructField("COLUMN15", StringType, true),
>>  | StructField("COLUMN16", StringType, true),
>>  | StructField("COLUMN17", StringType, true)
>>  | StructField("COLUMN18", StringType, true),
>>  | StructField("COLUMN19", StringType, true),
>>  | StructField("COLUMN20", StringType, true),
>>  | StructField("COLUMN21", StringType, true),
>>  | StructField("COLUMN22", StringType, true)))
>> :25: error: value StringType is not a member of String
>>StructField("C1" StringType, true),
>> ^
> 
> Would really appreciate the guidance/pointers.
> 
> Thanks,
> Divya 


Re: Using inteliJ for spark development

2015-12-23 Thread Eran Witkon
Thanks, all of these examples shows how to link to spark source and build
it as part of my project. why should I do that? why not point directly to
my spark.jar?
Am I missing something?
Eran

On Wed, Dec 23, 2015 at 9:59 AM Akhil Das 
wrote:

> 1. Install sbt plugin on IntelliJ
> 2. Create a new project/Import an sbt project like Dean suggested
> 3. Happy Debugging.
>
> You can also refer to this article for more information
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>
> Thanks
> Best Regards
>
> On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon 
> wrote:
>
>> Any pointers how to use InteliJ for spark development?
>> Any way to use scala worksheet run like spark- shell?
>>
>
>


Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
You will have to point to your spark-assembly.jar since spark has a lot of
dependencies. You can read the answers discussed over here to have a better
understanding
http://stackoverflow.com/questions/3589562/why-maven-what-are-the-benefits

Thanks
Best Regards

On Wed, Dec 23, 2015 at 4:27 PM, Eran Witkon  wrote:

> Thanks, all of these examples shows how to link to spark source and build
> it as part of my project. why should I do that? why not point directly to
> my spark.jar?
> Am I missing something?
> Eran
>
> On Wed, Dec 23, 2015 at 9:59 AM Akhil Das 
> wrote:
>
>> 1. Install sbt plugin on IntelliJ
>> 2. Create a new project/Import an sbt project like Dean suggested
>> 3. Happy Debugging.
>>
>> You can also refer to this article for more information
>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon 
>> wrote:
>>
>>> Any pointers how to use InteliJ for spark development?
>>> Any way to use scala worksheet run like spark- shell?
>>>
>>
>>


Re: Problem with Spark Standalone

2015-12-23 Thread luca_guerra
This is the master's log file:

15/12/22 03:23:05 ERROR FileAppender: Error writing stream to file
/disco1/spark-1.5.1/work/app-20151222032252-0010/0/stderr
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/12/22 03:23:05 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkExecutor@:38095] has failed, address is
now gated for [5000] ms. Reason: [Disassociated] 

And this is the worker's log file:

15/12/22 03:23:05 ERROR FileAppender: Error writing stream to file
/disco1/spark-1.5.1/work/app-20151222032252-0010/2/stderr
java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at
org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at
org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
at
org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
15/12/22 03:23:05 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkExecutor@:41370] has failed, address is
now gated for [5000] ms. Reason: [Disassociated] 

Thank you very much for the help.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-Spark-Standalone-tp25750p25786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Streaming json records from kafka ... how can I process ... help please :)

2015-12-23 Thread Akhil
Akhil wrote
> You can do it like this:
> 
> lines.foreachRDD(jsonRDD =>{ 
>
>   val data = sqlContext.read.json(jsonRDD)
>   data.registerTempTable("mytable")
>   sqlContext.sql("SELECT * FROM mytable")
> 
>   })

See
http://spark.apache.org/docs/latest/streaming-programming-guide.html#dataframe-and-sql-operations
and
http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets
for more information.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-json-records-from-kafka-how-can-I-process-help-please-tp25769p25782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using inteliJ for spark development

2015-12-23 Thread Akhil Das
1. Install sbt plugin on IntelliJ
2. Create a new project/Import an sbt project like Dean suggested
3. Happy Debugging.

You can also refer to this article for more information
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ

Thanks
Best Regards

On Mon, Dec 21, 2015 at 10:51 PM, Eran Witkon  wrote:

> Any pointers how to use InteliJ for spark development?
> Any way to use scala worksheet run like spark- shell?
>


Re: Streaming json records from kafka ... how can I process ... help please :)

2015-12-23 Thread Gideon
What you wrote is inaccurate.
When you create a directkafkastream what happens is that you actually create
DirectKafkaInputDStream. This DirectKafkaInputDStream extends a DStream. 2
functions that a DStream has are: map and print
when you map on your DirectKafkaInputDStream what you're actually getting is
a MappedDStream. MappedDStream also extends DStream which means you can
invoke print on it. 
DStream are a Spark Streaming abstraction that allows you to operate on RDDs
in the stream

Regarding the converting the JSON strings, I'm not quite sure what you mean
but you can easily transform the data using the different methods on the
DStream objects that you're getting (like your map example)

I hope that was clear



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-json-records-from-kafka-how-can-I-process-help-please-tp25769p25780.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: I coded an example to use Twitter stream as a data source for Spark

2015-12-23 Thread Amir Rahnama
Thats my goal brother.

But lets agree spark is not a very straight forward repo to get yourself
started.

I have got some initiaö code though.

On Wednesday, 23 December 2015, Akhil Das 
wrote:

> Why not create a custom dstream
>  and
> generate the data from there itself instead of spark connecting to a socket
> server which will be fed by another twitter client?
>
> Thanks
> Best Regards
>
> On Sat, Dec 19, 2015 at 5:47 PM, Amir Rahnama  > wrote:
>
>> Hi guys,
>>
>> Thought someone would need this:
>>
>> https://github.com/ambodi/realtime-spark-twitter-stream-mining
>>
>> you can use this approach to feed twitter stream to your spark job. So
>> far, PySpark does not have a twitter dstream source.
>>
>>
>>
>> --
>> Thanks and Regards,
>>
>> Amir Hossein Rahnama
>>
>> *Tel: +46 (0) 761 681 102*
>> Website: www.ambodi.com
>> Twitter: @_ambodi 
>>
>
>

-- 
Thanks and Regards,

Amir Hossein Rahnama

*Tel: +46 (0) 761 681 102*
Website: www.ambodi.com
Twitter: @_ambodi 


Re: Problem of submitting Spark task to cluster from eclipse IDE on Windows

2015-12-23 Thread Akhil Das
You need to:

1. Make sure your local router have NAT enabled and port forwarded the
networking ports listed here
.
2. Make sure on your clusters 7077 is accessible from your local (public)
ip address. You can try telnet 10.20.17.70 7077
3. Set spark.driver.host so that the cluster can connect back to your
machine.



Thanks
Best Regards

On Wed, Dec 23, 2015 at 10:02 AM, superbee84  wrote:

> Hi All,
>
>I'm new to Spark. Before I describe the problem, I'd like to let you
> know
> the role of the machines that organize the cluster and the purpose of my
> work. By reading and follwing the instructions and tutorials, I
> successfully
> built up a cluster with 7 CentOS-6.5 machines. I installed Hadoop 2.7.1,
> Spark 1.5.1, Scala 2.10.4 and ZooKeeper 3.4.5 on them. The details are
> listed as below:
>
>
> Host Name  |  IP Address  |  Hadoop 2.7.1 | Spark 1.5.1|
> ZooKeeper
> hadoop00   | 10.20.17.70  | NameNode(Active)   | Master(Active)   |   none
> hadoop01   | 10.20.17.71  | NameNode(Standby)| Master(Standby) |   none
> hadoop02   | 10.20.17.72  | ResourceManager(Active)| none  |   none
> hadoop03   | 10.20.17.73  | ResourceManager(Standby)| none|  none
> hadoop04   | 10.20.17.74  | DataNode  |  Worker  |
> JournalNode
> hadoop05   | 10.20.17.75  | DataNode  |  Worker  |
> JournalNode
> hadoop06   | 10.20.17.76  | DataNode  |  Worker  |
> JournalNode
>
>Now my *purpose* is to develop Hadoop/Spark applications on my own
> computer(IP: 10.20.6.23) and submit them to the remote cluster. As all the
> other guys in our group are in the habit of eclipse on Windows, I'm trying
> to work on this. I have successfully submitted the WordCount MapReduce job
> to YARN and it run smoothly through eclipse and Windows. But when I tried
> to
> run the Spark WordCount, it gives me the following error in the eclipse
> console:
>
> 15/12/23 11:15:30 INFO AppClient$ClientEndpoint: Connecting to master
> spark://10.20.17.70:7077...
> 15/12/23 11:15:50 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in
> thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.FutureTask@29ed85e7 rejected from
> java.util.concurrent.ThreadPoolExecutor@28f21632[Running, pool size = 1,
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at
>
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
> at java.util.concurrent.AbstractExecutorService.submit(Unknown
> Source)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:95)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint.tryRegisterAllMasters(AppClient.scala:95)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint.org$apache$spark$deploy$client$AppClient$ClientEndpoint$$registerWithMaster(AppClient.scala:121)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2$$anonfun$run$1.apply$mcV$sp(AppClient.scala:132)
> at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:1119)
> at
>
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anon$2.run(AppClient.scala:124)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> at java.util.concurrent.FutureTask.runAndReset(Unknown Source)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Unknown
> Source)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> at java.lang.Thread.run(Unknown Source)
> 15/12/23 11:15:50 INFO DiskBlockManager: Shutdown hook called
> 15/12/23 11:15:50 INFO ShutdownHookManager: Shutdown hook called
>
> Then