Checkpointing and accessing the checkpoint data

2019-06-27 Thread Jean-Georges Perrin
Hi Sparkians,

Few questions around checkpointing.

1. Checkpointing “dump” file / persisting to disk
Is the file encrypted or is it a standard parquet file? 

2. If the file is not encrypted, can I use it with another app (I know it’s 
kind of of a weird stretch case)

3. Have you/do you know of any performance comparison between the two? On small 
datasets, caching seems more performant, but I can imagine that there is a 
sweet spot…

Thanks!

jgp



Jean -Georges Perrin
j...@jgp.net






Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-15 Thread Jean-Georges Perrin
I see… Did you consider Structure Streaming?

Otherwise, you could create a factory that will build your higher level object, 
that will return an interface defining your API,  but the implementation may 
vary based on the context.

And English is not my native language as well...

Jean -Georges Perrin
j...@jgp.net




> On May 14, 2019, at 21:47, Gary Gao  wrote:
> 
> Thanks for reply, Jean
>   In my project , I'm working on higher abstraction layer of spark 
> streaming to build a data processing product and trying to provide a common 
> api for java and scala developers.
>   You can see the abstract class defined here: 
> https://github.com/InterestingLab/waterdrop/blob/master/waterdrop-apis/src/main/scala/io/github/interestinglab/waterdrop/apis/BaseStreamingInput.scal
>  
> <https://github.com/InterestingLab/waterdrop/blob/master/waterdrop-apis/src/main/scala/io/github/interestinglab/waterdrop/apis/BaseStreamingInput.scala>
>  
> 
>There is a method , getDStream, that return a DStream[T], which 
> currently support scala class to extend this class and override getDStream, 
> But I also want java class to extend this class to return a JavaDStream.
> This is my real problem. 
>  Tell me if the above description is not clear, because English is 
> not my native language. 
> 
> Thanks in advance
> Gary
> 
> On Tue, May 14, 2019 at 11:06 PM Jean Georges Perrin  <mailto:j...@jgp.net>> wrote:
> There are a little bit more than the list you specified  nevertheless, some 
> data types are not directly compatible between Scala and Java and requires 
> conversion, so it’s good to not pollute your code with plenty of conversion 
> and focus on using the straight API. 
> 
> I don’t remember from the top of my head, but if you use more Spark 2 
> features (dataframes, structured streaming...) you will require less of those 
> Java-specific API. 
> 
> Do you see a problem here? What’s your take on this?
> 
> jg
> 
> 
> On May 14, 2019, at 10:22, Gary Gao  <mailto:garygaow...@gmail.com>> wrote:
> 
>> Hi all,
>> 
>> I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we 
>> just use scala apis in java codes ? What's the difference ?
>> 
>> Some examples of Java-Friendly APIs commented in Spark code are as follows:
>> 
>>   JavaDStream
>>   JavaInputDStream
>>   JavaStreamingContext
>>   JavaSparkContext
>> 



Re: Why do we need Java-Friendly APIs in Spark ?

2019-05-14 Thread Jean Georges Perrin
There are a little bit more than the list you specified  nevertheless, some 
data types are not directly compatible between Scala and Java and requires 
conversion, so it’s good to not pollute your code with plenty of conversion and 
focus on using the straight API. 

I don’t remember from the top of my head, but if you use more Spark 2 features 
(dataframes, structured streaming...) you will require less of those 
Java-specific API. 

Do you see a problem here? What’s your take on this?

jg


> On May 14, 2019, at 10:22, Gary Gao  wrote:
> 
> Hi all,
> 
> I am wondering why do we need Java-Friendly APIs in Spark ? Why can't we just 
> use scala apis in java codes ? What's the difference ?
> 
> Some examples of Java-Friendly APIs commented in Spark code are as follows:
> 
>   JavaDStream
>   JavaInputDStream
>   JavaStreamingContext
>   JavaSparkContext
> 


Multiple sessions in one application?

2018-12-19 Thread Jean Georges Perrin
Hi there,

I was curious of what use cases would drive the use of newSession() (as in 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SparkSession.html#newSession--
 
).

I understand that you get a cleaner slate, but why would you need it?

Thanks,

jg




Re: how to generate a larg dataset paralleled

2018-12-13 Thread Jean Georges Perrin
You just want to generate some data in Spark or ingest a large dataset outside 
of Spark? What’s the ultimate goal you’re pursuing?

jg


> On Dec 13, 2018, at 21:38, lk_spark  wrote:
> 
> hi,all:
> I want't to generate some test data , which contained about one hundred 
> million rows .
> I create a dataset have ten rows ,and I do df.union operation in 'for' 
> circulation , but this will case the operation only happen on driver node.
> how can I do it on the whole cluster.
>  
> 2018-12-14
> lk_spark


Re: OData compliant API for Spark

2018-12-05 Thread Jean Georges Perrin
I was involved in a project like that and we decided to deploy the data in 
https://ckan.org/. We used Spark for the data pipeline and transformation. Hih. 

jg


> On Dec 4, 2018, at 21:14, Affan Syed  wrote:
> 
> All,
> 
> We have been thinking about exposing our platform for analytics an OData 
> server (for its ease of compliance with 3rd party BI tools like Tableau, etc) 
> -- so Livy is not in the picture right now.  
> 
> Has there been any effort on this regards? Is there any interest or has there 
> been any discussion that someone can point towards? 
> 
> We want to expose this connection over API so the JDBC->thriftserver->Spark 
> route is not being considered right now. 
> 
> 
> - Affan
> ᐧ


Re: Is there any Spark source in Java

2018-11-03 Thread Jean Georges Perrin
I would take this one very closely to my heart :) 

Look at:
https://github.com/jgperrin/net.jgp.labs.spark

And if the examples are too weird, have a look at:
http://jgp.net/book published at Manning 

Feedback appreciated!

jg


> On Nov 3, 2018, at 12:30, Jeyhun Karimov  wrote:
> 
> Hi Soheil,
> 
> From the spark github repo, you can find some classes implemented in Java:
> 
> https://github.com/apache/spark/search?l=java
> 
> Cheers,
> Jeyhun
> 
>> On Sat, Nov 3, 2018 at 6:42 PM Soheil Pourbafrani  
>> wrote:
>> Hi, I want to customize some part of Spark. I was wondering if there any 
>> Spark source is written in Java language, or all the sources are in Scala 
>> language?


Re: java vs scala for Apache Spark - is there a performance difference ?

2018-10-29 Thread Jean Georges Perrin
did not see anything, but curious if you find something.

I think one of the big benefit of using Java, for data engineering in the 
context of  Spark, is that you do not have to train a lot of your team to 
Scala. Now if you want to do data science, Java is probably not the best tool 
yet...

> On Oct 26, 2018, at 6:04 PM, karan alang  wrote:
> 
> Hello 
> - is there a "performance" difference when using Java or Scala for Apache 
> Spark ?
> 
> I understand, there are other obvious differences (less code with scala, 
> easier to focus on logic etc), 
> but wrt performance - i think there would not be much of a difference since 
> both of them are JVM based, 
> pls. let me know if this is not the case.
> 
> thanks!



Triangle Apache Spark Meetup

2018-10-10 Thread Jean Georges Perrin
Hi,


Just a small plug for Triangle Apache Spark Meetup (TASM) covers Raleigh, 
Durham, and Chapel Hill in North Carolina, USA. The group started back in July 
2015. More details here: https://www.meetup.com/Triangle-Apache-Spark-Meetup/ 
.

Can you add our meetup to http://spark.apache.org/community.html 
 ?

jg




Where is the DAG stored before catalyst gets it?

2018-10-04 Thread Jean Georges Perrin
Hi, 

I am assuming it is still in the master and when catalyst is finished it sends 
the tasks to the workers.

Correct?

tia

jg
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to merge multiple rows

2018-08-22 Thread Jean Georges Perrin
How do you do it now? 

You could use a withColumn(“newDetails”, ) 

jg


> On Aug 22, 2018, at 16:04, msbreuer  wrote:
> 
> A dataframe with following contents is given:
> 
> ID PART DETAILS
> 11 A1
> 12 A2
> 13 A3
> 21 B1
> 31 C1
> 
> Target format should be as following:
> 
> ID DETAILS
> 1 A1+A2+A3
> 2 B1
> 3 C1
> 
> Note, the order of A1-3 is important.
> 
> Currently I am using this alternative:
> 
> ID DETAIL_1 DETAIL_2 DETAIL_3
> 1 A1   A2   A3
> 2 B1
> 3 C1
> 
> What would be the best method to do such transformation an a large dataset?
> 
> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark sql data skew

2018-07-13 Thread Jean Georges Perrin
Just thinking out loud… repartition by key? create a composite key based on 
company and userid? 

How big is your dataset?

> On Jul 13, 2018, at 06:20, 崔苗  wrote:
> 
> Hi,
> when I want to count(distinct userId) by company,I met the data skew and the 
> task takes too long time,how to count distinct by keys on skew data in spark 
> sql ?
> 
> thanks for any reply
> 



Re: submitting dependencies

2018-06-27 Thread Jean Georges Perrin
Have you tried to build a uber jar to bundle all your classes together?

> On Jun 27, 2018, at 01:27, amin mohebbi  > wrote:
> 
> Could you please help me to understand how I should submit my spark 
> application ?
> 
> I have used this connector (pygmalios/reactiveinflux-spark 
> ) to connect spark with 
> influxdb. 
> 
>   
> pygmalios/reactiveinflux-spark
> reactiveinflux-spark - Connector between Spark and InfluxDB.
>  
> 
> 
> The code is working fine, but when I submit the jar file through spark-submit 
> I will face this error 
> 
> "Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/pygmalios/reactiveinflux/ReactiveInfluxDbName"
> 
> I submit like this : 
> spark-submit --jars 
> /home/sshuser/reactiveinflux-spark_2.10-1.4.0.10.0.5.1.jar sapn_2.11-1.0.jar
> 
> Can you help to solve this issue? 
> 
> 
> Best Regards ... Amin 
> Mohebbi PhD candidate in Software Engineering   at university of Malaysia   
> Tel : +60 18 2040 017 E-Mail : tp025...@ex.apiit.edu.my 
>    amin_...@me.com 
> 


A code example of Catalyst optimization

2018-06-04 Thread Jean Georges Perrin
Hi there,

I am looking for an example of optimization through Catalyst, that you can 
demonstrate via code. Typically, you load some data in a dataframe, you do 
something, you do the opposite operation, and, when you collect, it’s super 
fast because nothing really happened to the data. Hopefully, my request is 
clear enough - I’d like to use that in teaching when explaining the laziness of 
Spark.

Does anyone has that in his labs?

tia,

jg


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: What's the best way to have Spark a service?

2018-03-15 Thread Jean Georges Perrin
Hi David,

I ended building up my own. Livy sounded great on paper, but heavy to 
manipulate. I found out about Jobserver too late. We did not find too 
complicated to build ours, with a small Spring boot app that was holding the 
session (we did not need more than one session).

jg


> On Mar 15, 2018, at 07:06, David Espinosa  wrote:
> 
> Hi all,
> 
> I'm quite new to Spark, and I would like to ask whats the best way to have 
> Spark as a service, and for that I mean being able to include the response of 
> a scala app/job running in a Spark into a RESTful common request.
> 
> Up now I have read about Apache Livy (which I tried and found incompatibility 
> problems with my scala app), Spark Jobsserver, Finch and I read that I could 
> use also Spark Streams.
> 
> Thanks in advance,
> David


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Schema - DataTypes.NullType

2018-02-12 Thread Jean Georges Perrin
Thanks Nicholas. It makes sense. Now that I have a hint, I can play with it too!

jg

> On Feb 11, 2018, at 19:15, Nicholas Hakobian 
> <nicholas.hakob...@rallyhealth.com> wrote:
> 
> I spent a few minutes poking around in the source code and found this:
> 
> The data type representing None, used for the types that cannot be inferred.
> 
> https://github.com/apache/spark/blob/branch-2.1/python/pyspark/sql/types.py#L107-L113
>  
> <https://github.com/apache/spark/blob/branch-2.1/python/pyspark/sql/types.py#L107-L113>
> 
> Playing around a bit, this is the only use case that I could immediately come 
> up with; you have some type of a placeholder field already in data, but its 
> always null. If you let createDataFrame (and I bet other things like 
> DataFrameReader would behave similarly) try to infer it directly, it will 
> error out since it can't infer the schema automatically. Doing something like 
> below will allow the data to be used. And, if memory serves, Hive has a 
> concept of a Null data type also for these types of situations.
> 
> In [9]: df = spark.createDataFrame([Row(id=1, val=None), Row(id=2, 
> val=None)], schema=StructType([StructField('id', LongType()), 
> StructField('val', NullType())]))
> 
> In [10]: df.show()
> +---++
> | id| val|
> +---++
> |  1|null|
> |  2|null|
> +---++
> 
> 
> In [11]: df.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- val: null (nullable = true)
> 
> 
> Nicholas Szandor Hakobian, Ph.D.
> Staff Data Scientist
> Rally Health
> nicholas.hakob...@rallyhealth.com <mailto:nicholas.hakob...@rallyhealth.com>
> 
> 
> On Sun, Feb 11, 2018 at 5:40 AM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> What is the purpose of DataTypes.NullType, specially as you are building a 
> schema? Have anyone used it or seen it as spart of a schema auto-generation?
> 
> 
> (If I keep asking long enough, I may get an answer, no? :) )
> 
> 
> > On Feb 4, 2018, at 13:15, Jean Georges Perrin <j...@jgp.net 
> > <mailto:j...@jgp.net>> wrote:
> >
> > Any taker on this one? ;)
> >
> >> On Jan 29, 2018, at 16:05, Jean Georges Perrin <j...@jgp.net 
> >> <mailto:j...@jgp.net>> wrote:
> >>
> >> Hi Sparkians,
> >>
> >> Can someone tell me what is the purpose of DataTypes.NullType, specially 
> >> as you are building a schema?
> >>
> >> Thanks
> >>
> >> jg
> >> -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> >> <mailto:user-unsubscr...@spark.apache.org>
> >>
> >
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> > <mailto:user-unsubscr...@spark.apache.org>
> >
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 



Re: Schema - DataTypes.NullType

2018-02-11 Thread Jean Georges Perrin
What is the purpose of DataTypes.NullType, specially as you are building a 
schema? Have anyone used it or seen it as spart of a schema auto-generation?


(If I keep asking long enough, I may get an answer, no? :) )


> On Feb 4, 2018, at 13:15, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Any taker on this one? ;)
> 
>> On Jan 29, 2018, at 16:05, Jean Georges Perrin <j...@jgp.net> wrote:
>> 
>> Hi Sparkians,
>> 
>> Can someone tell me what is the purpose of DataTypes.NullType, specially as 
>> you are building a schema?
>> 
>> Thanks
>> 
>> jg
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Schema - DataTypes.NullType

2018-02-04 Thread Jean Georges Perrin
Any taker on this one? ;)

> On Jan 29, 2018, at 16:05, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Hi Sparkians,
> 
> Can someone tell me what is the purpose of DataTypes.NullType, specially as 
> you are building a schema?
> 
> Thanks
> 
> jg
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: is there a way to create new column with timeuuid using raw spark sql ?

2018-02-01 Thread Jean Georges Perrin
Sure, use withColumn()...

jg


> On Feb 1, 2018, at 05:50, kant kodali  wrote:
> 
> Hi All,
> 
> Is there any way to create a new timeuuid column of a existing dataframe 
> using raw sql? you can assume that there is a timeuuid udf function if that 
> helps.
> 
> Thanks!


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Schema - DataTypes.NullType

2018-01-29 Thread Jean Georges Perrin
Hi Sparkians,

Can someone tell me what is the purpose of DataTypes.NullType, specially as you 
are building a schema?

Thanks

jg
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Type Casting Error in Spark Data Frame

2018-01-29 Thread Jean Georges Perrin
You can try to create new columns with the nested value,

> On Jan 29, 2018, at 15:26, Arnav kumar  wrote:
> 
> Hello Experts,
> 
> I would need your advice in resolving the below issue when I am trying to 
> retrieving the data from a dataframe. 
> 
> Can you please let me know where I am going wrong.
> 
> code :
> 
> 
> // create the dataframe by parsing the json 
> // Message Helper describes the JSON Struct
> //data out is the json string received from Streaming Engine. 
> 
> val dataDF = sparkSession.createDataFrame(dataOut, MessageHelper.sqlMapping)
> dataDF.printSchema()
> /* -- out put of dataDF.printSchema
> 
> root
>  |-- messageID: string (nullable = true)
>  |-- messageType: string (nullable = true)
>  |-- meta: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- messageParsedTimestamp: string (nullable = true)
>  |||-- ipaddress: string (nullable = true)
>  |-- messageData: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- packetID: string (nullable = true)
>  |||-- messageID: string (nullable = true)
>  |||-- unixTime: string (nullable = true)
>  
> 
> 
> */
> 
> 
> dataDF.createOrReplaceTempView("message")
> val routeEventDF=sparkSession.sql("select messageId 
> ,messageData.unixTime,messageData.packetID, messageData.messageID from 
> message")
> routeEventDF.show
> 
> 
> Error  on routeEventDF.show
> Caused by: java.lang.RuntimeException: 
> org.apache.spark.sql.catalyst.expressions.GenericRow is not a valid external 
> type for schema of 
> array   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr14$(Unknown
>  Source)
> 
> 
> Appreciate your help
> 
> Best Regards
> Arnav Kumar.
> 
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Schema - DataTypes.NullType

2018-01-29 Thread Jean Georges Perrin
Hi Sparkians,

Can someone tell me what is the purpose of DataTypes.NullType, specially as you 
are building a schema?

Thanks

jg
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: S3 token times out during data frame "write.csv"

2018-01-25 Thread Jean Georges Perrin
Are you writing from an Amazon instance or from a on premise install to S3?
How many partitions are you writing from? Maybe you can try to “play” with 
repartitioning to see how it behaves?

> On Jan 23, 2018, at 17:09, Vasyl Harasymiv  wrote:
> 
> It is about 400 million rows. S3 automatically chunks the file on their end 
> while writing, so that's fine, e.g. creates the same file name with 
> alphanumeric suffixes. 
> However, the write session expires due to token expiration. 
> 
> On Tue, Jan 23, 2018 at 5:03 PM, Jörn Franke  > wrote:
>  How large is the file?
> 
> If it is very large then you should have anyway several partitions for the 
> output. This is also important in case you need to read again from S3 - 
> having several files there enables parallel reading.
> 
> On 23. Jan 2018, at 23:58, Vasyl Harasymiv  > wrote:
> 
>> Hi Spark Community,
>> 
>> Saving a data frame into a file on S3 using:
>> 
>> df.write.csv(s3_location)
>> 
>> If run for longer than 30 mins, the following error persists:
>> 
>> The provided token has expired. (Service: Amazon S3; Status Code: 400; Error 
>> Code: ExpiredToken;`)
>> 
>> Potentially, because there is a hardcoded session limit in temporary S3 
>> connection from Spark.
>> 
>> One can specify the duration as per here:
>> 
>> https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html
>>  
>> 
>> 
>> One can, of course, chunk data into sub-30 min writes. However, Is there a 
>> way to change the token expiry parameter directly in Spark before using 
>> "write.csv"?
>> 
>> Thanks a lot for any help!
>> Vasyl
>> 
>> 
>> 
>> 
>> 
>> On Tue, Jan 23, 2018 at 2:46 PM, Toy > > wrote:
>> Thanks, I get this error when I switched to s3a://
>> 
>> Exception in thread "streaming-job-executor-0" java.lang.NoSuchMethodError: 
>> com.amazonaws.services.s3.transfer.TransferManager.(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V
>>  at 
>> org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287)
>>  at 
>> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
>> 
>> On Tue, 23 Jan 2018 at 15:05 Patrick Alwell > > wrote:
>> Spark cannot read locally from S3 without an S3a protocol; you’ll more than 
>> likely need a local copy of the data or you’ll need to utilize the proper 
>> jars to enable S3 communication from the edge to the datacenter.
>> 
>>  
>> 
>> https://stackoverflow.com/questions/30385981/how-to-access-s3a-files-from-apache-spark
>>  
>> 
>>  
>> 
>> Here are the jars: 
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws 
>> 
>>  
>> 
>> Looks like you already have them, in which case you’ll have to make small 
>> configuration changes, e.g. s3 à s3a
>> 
>>  
>> 
>> Keep in mind: The Amazon JARs have proven very brittle: the version of the 
>> Amazon libraries must match the versions against which the Hadoop binaries 
>> were built.
>> 
>>  
>> 
>> https://hortonworks.github.io/hdp-aws/s3-s3aclient/index.html#using-the-s3a-filesystem-client
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>>  
>> 
>> From: Toy >
>> Date: Tuesday, January 23, 2018 at 11:33 AM
>> To: "user@spark.apache.org " 
>> >
>> Subject: I can't save DataFrame from running Spark locally
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> First of all, my Spark application runs fine in AWS EMR. However, I'm trying 
>> to run it locally to debug some issue. My application is just to parse log 
>> files and convert to DataFrame then convert to ORC and save to S3. However, 
>> when I run locally I get this error
>> 
>>  
>> 
>> java.io.IOException: /orc/dt=2018-01-23 doesn't exist
>> 
>> at 
>> org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:170)
>> 
>> at 
>> org.apache.hadoop.fs.s3.Jets3tFileSystemStore.retrieveINode(Jets3tFileSystemStore.java:221)
>> 
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 
>> at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> 
>> at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> 
>> at 
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>> 
>> 

Re: Custom Data Source for getting data from Rest based services

2017-12-24 Thread Jean Georges Perrin
If you need Java code, you can have a look @: 
https://github.com/jgperrin/net.jgp.labs.spark.datasources 


and:
https://databricks.com/session/extending-apache-sparks-ingestion-building-your-own-java-data-source
 


> On Dec 24, 2017, at 2:56 AM, Subarna Bhattacharyya 
>  wrote:
> 
> Hi Sourav,
> Looks like this would be a good utility for the development of large scale
> data driven product based on Data services. 
> 
> We are an early stage startup called Climformatics and  we are building a
> customized high resolution climate prediction tool. This effort requires
> synthesis of large scale data input from multiple data sources. This tool
> can help in getting large volume of data from multiple data services through
> api calls which are somewhat limited to their bulk use.
> 
> One feature that would help us further is if you could have a handle on
> setting the limits on how many data points can be grabbed at once, since the
> data sources that we access are often limited by the number of service calls
> that one can do at a time (say per minute).
> 
> Also we need a way to pass the parameter inputs (for multiple calls) through
> the url path itself. Many of the data sources we use need the parameters are
> to be included in the uri path itself instead of passing them as key/value
> parameter. An example is https://www.wunderground.com/weather/api/d/docs.
> 
> We would try to give a closer look to the github link you provided and get
> back to you with feedback.
> 
> Thanks,
> Sincerely,
> Subarna
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Storage at node or executor level

2017-12-22 Thread Jean Georges Perrin
Hi all,

This is more of a general architecture question, I have my idea, but wanted to 
confirm/infirm...

When your executor is accessing data, where is it stored: at the executor level 
or at the worker level? 

jg
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: learning Spark

2017-12-05 Thread Jean Georges Perrin
When you pick a book, make sure it covers the version of Spark you want to 
deploy. There are a lot of books out there that focus a lot on Spark 1.x. Spark 
2.x generalizes the dataframe API, introduces Tungsten, etc. All might not be 
relevant to a pure “sys admin” learning, but it is good to know.

jg

> On Dec 3, 2017, at 22:48, Manuel Sopena Ballesteros  
> wrote:
> 
> Dear Spark community,
>  
> Is there any resource (books, online course, etc.) available that you know of 
> to learn about spark? I am interested in the sys admin side of it? like the 
> different parts inside spark, how spark works internally, best ways to 
> install/deploy/monitor and how to get best performance possible.
>  
> Any suggestion?
>  
> Thank you very much
>  
> Manuel Sopena Ballesteros | Systems Engineer
> Garvan Institute of Medical Research 
> The Kinghorn Cancer Centre, 370 Victoria Street, Darlinghurst, NSW 2010
> T: + 61 (0)2 9355 5760 | F: +61 (0)2 9295 8507 | E: manuel...@garvan.org.au 
> 
>  
> NOTICE
> Please consider the environment before printing this email. This message and 
> any attachments are intended for the addressee named and may contain legally 
> privileged/confidential/copyright information. If you are not the intended 
> recipient, you should not read, use, disclose, copy or distribute this 
> communication. If you have received this message in error please notify us at 
> once by return email and then delete both messages. We accept no liability 
> for the distribution of viruses or similar in electronic communications. This 
> notice should not be removed.



Re: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-11-03 Thread Jean Georges Perrin
Write a UDF?

> On Oct 31, 2017, at 11:48, Aakash Basu  > wrote:
> 
> Hey all,
> 
> Any help in the below please?
> 
> Thanks,
> Aakash.
> 
> 
> -- Forwarded message --
> From: Aakash Basu  >
> Date: Tue, Oct 31, 2017 at 9:17 PM
> Subject: Regarding column partitioning IDs and names as per hierarchical 
> level SparkSQL
> To: user >
> 
> 
> Hi all,
> 
> I have to generate a table with Spark-SQL with the following columns -
> 
> 
> Level One Id: VARCHAR(20) NULL
> Level One Name: VARCHAR( 50) NOT NULL
> Level Two Id: VARCHAR( 20) NULL
> Level Two Name: VARCHAR(50) NULL
> Level Thr ee Id: VARCHAR(20) NULL
> Level Thr ee Name: VARCHAR(50) NULL
> Level Four Id: VARCHAR(20) NULL
> Level Four Name: VARCHAR( 50) NULL
> Level Five Id: VARCHAR(20) NULL
> Level Five Name: VARCHAR(50) NULL
> Level Six Id: VARCHAR(20) NULL
> Level Six Name: VARCHAR(50) NULL
> Level Seven Id: VARCHAR( 20) NULL
> Level Seven Name: VARCHAR(50) NULL
> Level Eight Id: VARCHAR( 20) NULL
> Level Eight Name: VARCHAR(50) NULL
> Level Nine Id: VARCHAR(20) NULL
> Level Nine Name: VARCHAR( 50) NULL
> Level Ten Id: VARCHAR(20) NULL
> Level Ten Name: VARCHAR(50) NULL
> 
> My input source has these columns -
> 
> 
> IDDescription ParentID
> 10Great-Grandfather
> 1010  Grandfather 10
> 1010101. Father A 1010
> 1010112. Father B 1010
> 1010124. Father C 1010
> 1010135. Father D 1010
> 1010153. Father E 1010
> 101018Father F1010
> 1010196. Father G 1010
> 101020Father H1010
> 101021Father I1010
> 1010222A. Father J1010
> 10101010  2. Father K 101010
> 
> Like the above, I have ID till 20 digits, which means, I have 10 levels.
> 
> I want to populate the ID and name itself along with all the parents till the 
> root for any particular level, which I am unable to create a concrete logic 
> for.
> 
> Am using this way to fetch respecting levels and populate them in the 
> respective columns but not their parents -
> 
> Present Logic ->
> 
> FinalJoin_DF = spark.sql("select "
>   + "case when length(a.id )/2 = '1' 
> then a.id  else ' ' end as level_one_id, "
> + "case when length(a.id )/2 = '1' then 
> a.desc else ' ' end as level_one_name, "
> + "case when length(a.id )/2 = '2' then 
> a.id  else ' ' end as level_two_id, "
> + "case when length(a.id )/2 = '2' then 
> a.desc else ' ' end as level_two_name, "
>   + "case when length(a.id )/2 = '3' 
> then a.id  else ' ' end as level_three_id, "
>   + "case when length(a.id )/2 = '3' 
> then a.desc else ' ' end as level_three_name, "
>   + "case when length(a.id )/2 = '4' 
> then a.id  else ' ' end as level_four_id, "
>   + "case when length(a.id )/2 = '4' 
> then a.desc else ' ' end as level_four_name, "
>   + "case when length(a.id )/2 = '5' 
> then a.id  else ' ' end as level_five_id, "
>   + "case when length(a.id )/2 = '5' 
> then a.desc else ' ' end as level_five_name, "
>   + "case when length(a.id )/2 = '6' 
> then a.id  else ' ' end as level_six_id, "
> + "case when length(a.id )/2 = '6' then 
> a.desc else ' ' end as level_six_name, "
> + "case when length(a.id )/2 = '7' then 
> a.id  else ' ' end as level_seven_id, "
>   + "case when length(a.id )/2 = '7' 
> then a.desc else ' ' end as level_seven_name, "
>   + "case when length(a.id )/2 = '8' 
> then a.id  else ' ' end as level_eight_id, "
> + "case when length(a.id )/2 = '8' then 
> a.desc else ' ' end as level_eight_name, "
>   + "case when length(a.id )/2 = '9' 
> then a.id  else ' ' end as level_nine_id, "
> + "case when length(a.id )/2 = '9' then 
> a.desc else ' ' end as level_nine_name, "
> + "case when length(a.id )/2 = '10' 
> then a.id  else ' ' end as level_ten_id, "
>   + "case when length(a.id )/2 = '10' 
> then a.desc else ' ' end as level_ten_name "
> + "from 

Re: How to get the data url

2017-11-03 Thread Jean Georges Perrin
I am a little confused by your question… Are you trying to ingest a file from 
S3?

If so… look for net.jgp.labs.spark on GitHub and look for 
net.jgp.labs.spark.l000_ingestion.l001_csv_in_progress.S3CsvToDataset 

You can modify the file as the keys are yours…

If you want to download first: look at 
net.jgp.labs.spark.l900_analytics.ListNCSchoolDistricts

jgp

> On Oct 29, 2017, at 22:20, onoke  > wrote:
> 
> Hi,
> 
> I am searching a useful API for getting a data URL that is accessed by a
> application on Spark.
> For example, when this URL is in a application
> 
>   new
> URL("https://s3.amazonaws.com/apache-zeppelin/tutorial/bank/bank.csv 
> ")
> 
> How to get this url from using Spark API?
> I looked in org.apach.api.java and org.apache.spark.status.api.v1, but they
> do not provide any URL info.
> 
> Any advice are welcome.
> 
> -Keiji 
> 
> 
> 
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 



Re: Hi all,

2017-11-03 Thread Jean Georges Perrin
Hi Oren,

Why don’t you want to use a GroupBy? You can cache or checkpoint the result and 
use it in your process, keeping everything in Spark and avoiding 
save/ingestion...


> On Oct 31, 2017, at 08:17, ⁨אורן שמון⁩ <⁨oren.sha...@gmail.com⁩> wrote:
> 
> I have 2 spark jobs one is pre-process and the second is the process.
> Process job needs to calculate for each user in the data.
> I want  to avoid shuffle like groupBy so I think about to save the result of 
> the pre-process as bucket by user in Parquet or to re-partition by user and 
> save the result .
> 
> What is prefer ? and why 
> Thanks in advance,
> Oren


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Anyone knows how to build and spark on jdk9?

2017-10-27 Thread Jean Georges Perrin
May I ask what is the use case? Although it is a very interesting question, but 
I would be concerned about going further than a proof of concept. A lot of the 
enterprises I see and visit are barely on Java8, so starting to talk JDK 9 
might be a slight overkill but if you have a good story, I’m all for it!

jg


> On Oct 27, 2017, at 03:44, Zhang, Liyun  wrote:
> 
> Thanks your suggestion, seems that scala 2.12.4 support jdk9
>  
> Scala 2.12.4 is now available.
> 
> Our benchmarks show a further reduction in compile times since 2.12.3 of 
> 5-10%.
> 
> Improved Java 9 friendliness, with more to come!
> 
>  
> Best Regards
> Kelly Zhang/Zhang,Liyun
>  
>  
>  
>  
>  
> From: Reynold Xin [mailto:r...@databricks.com] 
> Sent: Friday, October 27, 2017 10:26 AM
> To: Zhang, Liyun ; d...@spark.apache.org; 
> user@spark.apache.org
> Subject: Re: Anyone knows how to build and spark on jdk9?
>  
> It probably depends on the Scala version we use in Spark supporting Java 9 
> first. 
>  
> On Thu, Oct 26, 2017 at 7:22 PM Zhang, Liyun  wrote:
> Hi all:
> 1.   I want to build spark on jdk9 and test it with Hadoop on jdk9 env. I 
> search for jiras related to JDK9. I only found SPARK-13278.  This means now 
> spark can build or run successfully on JDK9 ?
>  
>  
> Best Regards
> Kelly Zhang/Zhang,Liyun
>  


Re: What is the equivalent of forearchRDD in DataFrames?

2017-10-26 Thread Jean Georges Perrin
Just hints: Repartition in 10? Get the RDD from the dataframe?

What about a forEach row and send every 100? (I just did that actually)

jg


> On Oct 26, 2017, at 13:37, Noorul Islam Kamal Malmiyoda  
> wrote:
> 
> Hi all,
> 
> I have a Dataframe with 1000 records. I want to split them into 100
> each and post to rest API.
> 
> If it was RDD, I could use something like this
> 
>myRDD.foreachRDD {
>  rdd =>
>rdd.foreachPartition {
>  partition => {
> 
> This will ensure that code is executed on executors and not on driver.
> 
> Is there any similar approach that we can take for Dataframes? I see
> examples on stackoverflow with collect() which will bring whole data
> to driver.
> 
> Thanks and Regards
> Noorul
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Quick one... AWS SDK version?

2017-10-07 Thread Jean Georges Perrin

Hey Marco,

I am actually reading from S3 and I use 2.7.3, but I inherited the project and 
they use some AWS API from Amazon SDK, which version is like from yesterday :) 
so it’s confused and AMZ is changing its version like crazy so it’s a little 
difficult to follow. Right now I went back to 2.7.3 and SDK 1.7.4...

jg


> On Oct 7, 2017, at 15:34, Marco Mistroni  wrote:
> 
> Hi JG
>  out of curiosity what's ur usecase? are you writing to S3? you could use 
> Spark to do that , e.g using hadoop package  
> org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client which 
> is in line with hadoop 2.7.1?
> 
> hth
>  marco
> 
>> On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly  
>> wrote:
>> Note: EMR builds Hadoop, Spark, et al, from source against specific versions 
>> of certain packages like the AWS Java SDK, httpclient/core, Jackson, etc., 
>> sometimes requiring some patches in these applications in order to work with 
>> versions of these dependencies that differ from what the applications may 
>> support upstream.
>> 
>> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis connector, 
>> that is, since that's the only part of Spark that actually depends upon the 
>> AWS Java SDK directly) against AWS Java SDK 1.11.160 instead of the much 
>> older version that vanilla Hadoop 2.7.3 would otherwise depend upon.
>> 
>> ~ Jonathan
>> 
>>> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran  
>>> wrote:
 On 3 Oct 2017, at 21:37, JG Perrin  wrote:
 
 Sorry Steve – I may not have been very clear: thinking about 
 aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled 
 with Spark.
>>> 
>>> 
>>> I know, but if you are talking to s3 via the s3a client, you will need the 
>>> SDK version to match the hadoop-aws JAR of the same version of Hadoop your 
>>> JARs have. Similarly, if you were using spark-kinesis, it needs to be in 
>>> sync there. 
  
 From: Steve Loughran [mailto:ste...@hortonworks.com] 
 Sent: Tuesday, October 03, 2017 2:20 PM
 To: JG Perrin 
 Cc: user@spark.apache.org
 Subject: Re: Quick one... AWS SDK version?
  
  
 On 3 Oct 2017, at 02:28, JG Perrin  wrote:
  
 Hey Sparkians,
  
 What version of AWS Java SDK do you use with Spark 2.2? Do you stick with 
 the Hadoop 2.7.3 libs?
  
 You generally to have to stick with the version which hadoop was built 
 with I'm afraid...very brittle dependency. 
> 


Re: Spark code to get select firelds from ES

2017-09-20 Thread Jean Georges Perrin
Same issue with RDBMS ingestion (I think). I solved it with views. Can you do 
views on ES?

jg


> On Sep 20, 2017, at 09:22, Kedarnath Dixit  
> wrote:
> 
> Hi,
> 
> I want to get only select fields from ES using Spark ES connector.
> 
> I have done some code which is fetching all the documents matching given 
> index as below:
> 
> JavaPairRDD> esRDD = JavaEsSpark.esRDD(jsc, 
> searchIndex);
> 
> However, is there a way to only get specific fields from documents for every 
> index in ES than getting everything ?
> 
> Example: Let's say, I have many fields in the documents as below and I have 
> @timestamp  which is also a field in the response { .., 
> @timestamp=Fri  Jul 07 01:36:00 IST 2017, ..}, Here how can I get 
> the only  field @timestamp for all my indexes ?
> 
> I could see something here but unable to correlate. can someone help me 
> please ?
> 
> 
> Many Thanks!
> ~KD
> 
> 
> 
> 
> 
> 
> 
> 
>  With Regards,
>   
>  ~Kedar Dixit
> 
> 
> 
> 
>  kedarnath_di...@persistent.com | @kedarsdixit | M +91 90499 15588 | T +91 
> (20) 6703 4783
> Persistent Systems | Partners In Innovation | www.persistent.com
> 
> 
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is the 
> property of Persistent Systems Ltd. It is intended only for the use of the 
> individual or entity to which it is addressed. If you are not the intended 
> recipient, you are not authorized to read, retain, copy, print, distribute or 
> use this message. If you have received this communication in error, please 
> notify the sender and delete all copies of this message. Persistent Systems 
> Ltd. does not accept any liability for virus infected mails.
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Nested RDD operation

2017-09-19 Thread Jean Georges Perrin
Have you tried to cache? maybe after the collect() and before the map?

> On Sep 19, 2017, at 7:20 AM, Daniel O' Shaughnessy 
> <danieljamesda...@gmail.com> wrote:
> 
> Thanks for your response Jean. 
> 
> I managed to figure this out in the end but it's an extremely slow solution 
> and not tenable for my use-case:
> 
>   val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>   //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>   rddX.take(5).foreach(println)
>   val severalRows = rddX.collect().map(row =>
> if (row.length == 1) {
>   
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0).toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
> } else {
>   row.map(tool => {
> 
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(tool.toString)).toDF("event_name")).select("eventIndex").first().getDouble(0))
>   })
>   })
> 
> Wondering if there is any better/faster way to do this ?
> 
> Thanks.
> 
> 
> 
> On Fri, 15 Sep 2017 at 13:31 Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hey Daniel, not sure this will help, but... I had a similar need where i 
> wanted the content of a dataframe to become a "cell" or a row in the parent 
> dataframe. I grouped by the child dataframe, then collect it as a list in the 
> parent dataframe after a join operation. As I said, not sure it matches your 
> use case, but HIH...
> jg
> 
>> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
>> <danieljamesda...@gmail.com <mailto:danieljamesda...@gmail.com>> wrote:
>> 
>> Hi guys,
>> 
>> I'm having trouble implementing this scenario:
>> 
>> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
>> 'pear', 'pear']
>> 
>> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
>> 
>> I'm attempting to do this but because of the nested operation on another RDD 
>> I get the NPE.
>> 
>> Here's my code so far, thanks:
>> 
>> val dfWithSchema = 
>> sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", "event_name")
>> 
>>   // attempting
>>   import sqlContext.implicits._
>>   val event_list = dfWithSchema.select("event_name").distinct
>>   val event_listDF = event_list.toDF()
>>   val eventIndexer = new StringIndexer()
>> .setInputCol("event_name")
>> .setOutputCol("eventIndex")
>> .fit(event_listDF)
>> 
>>   val eventIndexed = eventIndexer.transform(event_listDF)
>> 
>>   val converter = new IndexToString()
>> .setInputCol("eventIndex")
>> .setOutputCol("originalCategory")
>> 
>>   val convertedEvents = converter.transform(eventIndexed)
>>   val rddX = 
>> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>>  replaceAll ("[\\[\\]\"]", "")).toList)
>>   //val oneRow = 
>> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> 
>>   val severalRows = rddX.map(row => {
>> // Split array into n tools
>> println("ROW: " + row(0).toString)
>> println(row(0).getClass)
>> println("PRINT: " + 
>> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
>> 
>> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>>  Seq(row).toString)
>>   })
>>   // attempting
> 



[Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

2017-09-18 Thread Jean Georges Perrin
Hi,

I am trying to connect to a new cluster I just set up.

And I get...
[Timer-0:WARN] Logging$class: Initial job has not accepted any resources; check 
your cluster UI to ensure that workers are registered and have sufficient 
resources

I must have forgotten something really super obvious.

My connection code is:
SparkSession spark = SparkSession.builder()
.appName("JavaSparkPi")
.master("spark://10.0.100.81:7077")
.config("spark.executor.memory", "4g")
.config("spark.executor.cores", "2")
.getOrCreate();

My master is on 10.0.100.81, start with a simple start-master.sh. I have 2 
nodes, which i run with ./sbin/start-slave.sh spark://un.oplo.io:7077 
.

I use Spark's own resource scheduler.

jg






Re: Nested RDD operation

2017-09-15 Thread Jean Georges Perrin
Hey Daniel, not sure this will help, but... I had a similar need where i wanted 
the content of a dataframe to become a "cell" or a row in the parent dataframe. 
I grouped by the child dataframe, then collect it as a list in the parent 
dataframe after a join operation. As I said, not sure it matches your use case, 
but HIH...
jg

> On Sep 15, 2017, at 5:42 AM, Daniel O' Shaughnessy 
>  wrote:
> 
> Hi guys,
> 
> I'm having trouble implementing this scenario:
> 
> I have a column with a typical entry being : ['apple', 'orange', 'apple', 
> 'pear', 'pear']
> 
> I need to use a StringIndexer to transform this to : [0, 2, 0, 1, 1]
> 
> I'm attempting to do this but because of the nested operation on another RDD 
> I get the NPE.
> 
> Here's my code so far, thanks:
> 
> val dfWithSchema = sqlContext.createDataFrame(eventFeaturesRDD).toDF("email", 
> "event_name")
> 
>   // attempting
>   import sqlContext.implicits._
>   val event_list = dfWithSchema.select("event_name").distinct
>   val event_listDF = event_list.toDF()
>   val eventIndexer = new StringIndexer()
> .setInputCol("event_name")
> .setOutputCol("eventIndex")
> .fit(event_listDF)
> 
>   val eventIndexed = eventIndexer.transform(event_listDF)
> 
>   val converter = new IndexToString()
> .setInputCol("eventIndex")
> .setOutputCol("originalCategory")
> 
>   val convertedEvents = converter.transform(eventIndexed)
>   val rddX = 
> dfWithSchema.select("event_name").rdd.map(_.getString(0).split(",").map(_.trim
>  replaceAll ("[\\[\\]\"]", "")).toList)
>   //val oneRow = 
> Converted(eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq("CCB")).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
>   val severalRows = rddX.map(row => {
> // Split array into n tools
> println("ROW: " + row(0).toString)
> println(row(0).getClass)
> println("PRINT: " + 
> eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row(0))).toDF("event_name")).select("eventIndex").first().getDouble(0))
> 
> (eventIndexer.transform(sqlContext.sparkContext.parallelize(Seq(row)).toDF("event_name")).select("eventIndex").first().getDouble(0),
>  Seq(row).toString)
>   })
>   // attempting



Re: How to convert Row to JSON in Java?

2017-09-10 Thread Jean Georges Perrin
Sorry - more likely l700 save. 

jg


> On Sep 10, 2017, at 20:56, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Hey,
> 
> I have a few examples https://github.com/jgperrin/net.jgp.labs.spark. I 
> recently worked on such problems, so there's definitely a solution there or 
> I'll be happy to write one for you. 
> 
> Look in l250 map... 
> 
> jg
> 
> 
>> On Sep 10, 2017, at 20:51, ayan guha <guha.a...@gmail.com> wrote:
>> 
>> Sorry for side-line question, but for Python, isn't following the easiest:
>> 
>> >>> import json
>> >>> df1 = df.rdd.map(lambda r: json.dumps(r.asDict()))
>> >>> df1.take(10)
>> ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}', '{"id": 5}']
>> 
>> 
>> 
>> 
>>> On Mon, Sep 11, 2017 at 4:22 AM, Riccardo Ferrari <ferra...@gmail.com> 
>>> wrote:
>>> Hi Kant,
>>> 
>>> You can check the getValuesMap. I found this post useful, it is in Scala 
>>> but should be a good starting point.
>>> An alternative approach is combine the 'struct' and 'to_json' functions. I 
>>> have not tested this in Java but I am using it in Python.
>>> 
>>> Best,
>>> 
>>>> On Sun, Sep 10, 2017 at 1:45 AM, kant kodali <kanth...@gmail.com> wrote:
>>>> toJSON on Row object.
>>>> 
>>>>> On Sat, Sep 9, 2017 at 4:18 PM, Felix Cheung <felixcheun...@hotmail.com> 
>>>>> wrote:
>>>>> toJSON on Dataset/DataFrame?
>>>>> 
>>>>> From: kant kodali <kanth...@gmail.com>
>>>>> Sent: Saturday, September 9, 2017 4:15:49 PM
>>>>> To: user @spark
>>>>> Subject: How to convert Row to JSON in Java?
>>>>>  
>>>>> Hi All,
>>>>> 
>>>>> How to convert Row to JSON in Java? It would be nice to have .toJson() 
>>>>> method in the Row class.
>>>>> 
>>>>> Thanks,
>>>>> kant
>>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Best Regards,
>> Ayan Guha


Re: How to convert Row to JSON in Java?

2017-09-10 Thread Jean Georges Perrin
Hey,

I have a few examples https://github.com/jgperrin/net.jgp.labs.spark. I 
recently worked on such problems, so there's definitely a solution there or 
I'll be happy to write one for you. 

Look in l250 map... 

jg


> On Sep 10, 2017, at 20:51, ayan guha  wrote:
> 
> Sorry for side-line question, but for Python, isn't following the easiest:
> 
> >>> import json
> >>> df1 = df.rdd.map(lambda r: json.dumps(r.asDict()))
> >>> df1.take(10)
> ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}', '{"id": 5}']
> 
> 
> 
> 
>> On Mon, Sep 11, 2017 at 4:22 AM, Riccardo Ferrari  wrote:
>> Hi Kant,
>> 
>> You can check the getValuesMap. I found this post useful, it is in Scala but 
>> should be a good starting point.
>> An alternative approach is combine the 'struct' and 'to_json' functions. I 
>> have not tested this in Java but I am using it in Python.
>> 
>> Best,
>> 
>>> On Sun, Sep 10, 2017 at 1:45 AM, kant kodali  wrote:
>>> toJSON on Row object.
>>> 
 On Sat, Sep 9, 2017 at 4:18 PM, Felix Cheung  
 wrote:
 toJSON on Dataset/DataFrame?
 
 From: kant kodali 
 Sent: Saturday, September 9, 2017 4:15:49 PM
 To: user @spark
 Subject: How to convert Row to JSON in Java?
  
 Hi All,
 
 How to convert Row to JSON in Java? It would be nice to have .toJson() 
 method in the Row class.
 
 Thanks,
 kant
>>> 
>> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha


Re: Spark 2.0.0 and Hive metastore

2017-08-29 Thread Jean Georges Perrin
Sorry if my comment is not helping, but... why do you need Hive? Can't you save 
your aggregation using parquet for example?

jg


> On Aug 29, 2017, at 08:34, Andrés Ivaldi  wrote:
> 
> Hello, I'm using Spark API and with Hive support, I dont have a Hive 
> instance, just using Hive for some aggregation functions.
> 
> The problem is that Hive crete the hive and metastore_db folder at the temp 
> folder, I want to change that location
> 
> Regards.
> 
> -- 
> Ing. Ivaldi Andres


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark 2 | Java | Dataset

2017-08-17 Thread Jean Georges Perrin
Hey,

I was wondering if it would make sense to have a Dataset of something else than 
Row?

Does anyone has an example (in Java) or use case?

My use case would be to use Spark on existing objects we have and benefit from 
the distributed processing on those objects.

jg



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Quick one on evaluation

2017-08-04 Thread Jean Georges Perrin
Thanks Daniel,

I like your answer for #1. It makes sense.

However, I don't get why you say that there are always pending 
transformations... After you call an action, you should be "clean" from pending 
transformations, no?

> On Aug 3, 2017, at 5:53 AM, Daniel Darabos <daniel.dara...@lynxanalytics.com> 
> wrote:
> 
> 
> On Wed, Aug 2, 2017 at 2:16 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi Sparkians,
> 
> I understand the lazy evaluation mechanism with transformations and actions. 
> My question is simpler: 1) are show() and/or printSchema() actions? I would 
> assume so...
> 
> show() is an action (it prints data) but printSchema() is not an action. 
> Spark can tell you the schema of the result without computing the result.
> 
> and optional question: 2) is there a way to know if there are transformations 
> "pending"?
>  
> There are always transformations pending :). An RDD or DataFrame is a series 
> of pending transformations. If you say val df = spark.read.csv("foo.csv"), 
> that is a pending transformation. Even spark.emptyDataFrame is best 
> understood as a pending transformation: it does not do anything on the 
> cluster, but records locally what it will have to do on the cluster.



Re: SPARK Issue in Standalone cluster

2017-08-04 Thread Jean Georges Perrin
I use CIFS and it works reasonably well and easily cross platform, well 
documented...

> On Aug 4, 2017, at 6:50 AM, Steve Loughran  wrote:
> 
> 
>> On 3 Aug 2017, at 19:59, Marco Mistroni  wrote:
>> 
>> Hello
>> my 2 cents here, hope it helps
>> If you want to just to play around with Spark, i'd leave Hadoop out, it's an 
>> unnecessary dependency that you dont need for just running a python script
>> Instead do the following:
>> - got to the root of our master / slave node. create a directory 
>> /root/pyscripts 
>> - place your csv file there as well as the python script
>> - run the script to replicate the whole directory  across the cluster (i 
>> believe it's called copy-script.sh)
>> - then run your spark-submit , it will be something lke
>>./spark-submit /root/pyscripts/mysparkscripts.py  
>> file:///root/pyscripts/tree_addhealth.csv 10 --master 
>> spark://ip-172-31-44-155.us-west-2.compute.internal:7077
>> - in your python script, as part of your processing, write the parquet file 
>> in directory /root/pyscripts 
>> 
> 
> That's going to hit the commit problem discussed: only the spark driver 
> executes the final commit process; the output from the other servers doesn't 
> get picked up and promoted. You need a shared stpre (NFS is the easy one)
> 
> 
>> If you have an AWS account and you are versatile with that - you need to 
>> setup bucket permissions etc - , you can just
>> - store your file in one of your S3 bucket
>> - create an EMR cluster
>> - connect to master or slave
>> - run your  scritp that reads from the s3 bucket and write to the same s3 
>> bucket
> 
> 
> Aah, and now we are into the problem of implementing a safe commit protocol 
> for an inconsistent filesystem
> 
> My current stance there is out-the-box S3 isn't safe to use as the direct 
> output of work, Azure is. It mostly works for a small experiment, but I 
> wouldn't use it in production.
> 
> Simplest: work on one machine, if you go to 2-3 for exploratory work: NFS
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
Hey Jörn,

The "pending" was more something like a flag like myDf.hasCatalystWorkToDo() or 
myDf.isPendingActions(). Maybe an access to the DAG?

I just did that:
ordersDf = ordersDf.withColumn(
"time_to_ship", 
datediff(ordersDf.col("ship_date"), ordersDf.col("order_date")));

ordersDf.printSchema();
ordersDf.show();

and the schema and data is correct, so I was wondering what triggered 
Catalyst...

jg



> On Aug 2, 2017, at 8:29 AM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> I assume printschema would not trigger an evaluation. Show might partially 
> triggger an evaluation (not all data is shown only a certain number of rows 
> by default).
> Keep in mind that even a count might not trigger evaluation of all rows 
> (especially in the future) due to updates on the optimizer. 
> 
> What do you mean by pending ? You can see the status of the job in the UI. 
> 
>> On 2. Aug 2017, at 14:16, Jean Georges Perrin <j...@jgp.net> wrote:
>> 
>> Hi Sparkians,
>> 
>> I understand the lazy evaluation mechanism with transformations and actions. 
>> My question is simpler: 1) are show() and/or printSchema() actions? I would 
>> assume so...
>> 
>> and optional question: 2) is there a way to know if there are 
>> transformations "pending"?
>> 
>> Thanks!
>> 
>> jg
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 



Quick one on evaluation

2017-08-02 Thread Jean Georges Perrin
Hi Sparkians,

I understand the lazy evaluation mechanism with transformations and actions. My 
question is simpler: 1) are show() and/or printSchema() actions? I would assume 
so...

and optional question: 2) is there a way to know if there are transformations 
"pending"?

Thanks!

jg

 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is there a way to run Spark SQL through REST?

2017-07-22 Thread Jean Georges Perrin
There's Livi but it's pretty resource intensive.

I know it's not helpful but my company has developed its own and I am trying to 
Open Source it. 

Looks like there are quite a few companies who had the need and custom build. 

jg


> On Jul 22, 2017, at 04:01, kant kodali  wrote:
> 
> Is there a way to run Spark SQL through REST?


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [ANNOUNCE] Announcing Apache Spark 2.2.0

2017-07-11 Thread Jean Georges Perrin
Awesome! Congrats! Can't wait!!

jg


> On Jul 11, 2017, at 18:48, Michael Armbrust  wrote:
> 
> Hi all,
> 
> Apache Spark 2.2.0 is the third release of the Spark 2.x line. This release 
> removes the experimental tag from Structured Streaming. In addition, this 
> release focuses on usability, stability, and polish, resolving over 1100 
> tickets.
> 
> We'd like to thank our contributors and users for their contributions and 
> early feedback to this release. This release would not have been possible 
> without you.
> 
> To download Spark 2.2.0, head over to the download page: 
> http://spark.apache.org/downloads.html
> 
> To view the release notes: 
> https://spark.apache.org/releases/spark-release-2-2-0.html
> 
> (note: If you see any issues with the release notes, webpage or published 
> artifacts, please contact me directly off-list) 
> 
> Michael


Re: "Sharing" dataframes...

2017-06-21 Thread Jean Georges Perrin
I have looked at Livy in the (very recent past) past and it will not do the 
trick for me. It seems pretty greedy in terms of resources (or at least that 
was our experience). I will investigate how job-server could do the trick.

(on a side note I tried to find a paper on memory lifecycle within Spark but 
was not very successful, maybe someone has a link to spare.)

My need is to keep one/several dataframes in memory (well, within Spark) so 
it/they can be reused at a later time, without persisting it/them to disk 
(unless Spark wants to, of course).



> On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
> 
> This is a puzzling suggestion to me. It's unclear what features the OP needs, 
> so it's really hard to say whether Livy or job-server aren't sufficient. It's 
> true that neither are particularly mature, but they're much more mature than 
> a homemade project which hasn't started yet.
> 
> That said, I'm not very familiar with either project, so perhaps there are 
> some big concerns I'm not aware of.
> 
> --
> Michael Mior
> mm...@apache.org <mailto:mm...@apache.org>
> 
> 2017-06-21 3:19 GMT-04:00 Rick Moritz <rah...@gmail.com 
> <mailto:rah...@gmail.com>>:
> Keeping it inside the same program/SparkContext is the most performant 
> solution, since you can avoid serialization and deserialization. 
> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM and 
> invokes serialization and deserialization. Technologies that can help you do 
> that easily are Ignite (as mentioned) but also Alluxio, Cassandra with 
> in-memory tables and a memory-backed HDFS-directory (see tiered storage).
> Although livy and job-server provide the functionality of providing a single 
> SparkContext to mutliple programs, I would recommend you build your own 
> framework for integrating different jobs, since many features you may need 
> aren't present yet, while others may cause issues due to lack of maturity. 
> Artificially splitting jobs is in general a bad idea, since it breaks the DAG 
> and thus prevents some potential push-down optimizations.
> 
> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Thanks Vadim & Jörn... I will look into those.
> 
> jg
> 
>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com 
>> <mailto:vadim.seme...@datadoghq.com>> wrote:
>> 
>> You can launch one permanent spark context and then execute your jobs within 
>> the context. And since they'll be running in the same context, they can 
>> share data easily.
>> 
>> These two projects provide the functionality that you need:
>> https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
>>  
>> <https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs>
>> https://github.com/cloudera/livy#post-sessions 
>> <https://github.com/cloudera/livy#post-sessions>
>> 
>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> Hey,
>> 
>> Here is my need: program A does something on a set of data and produces 
>> results, program B does that on another set, and finally, program C combines 
>> the data of A and B. Of course, the easy way is to dump all on disk after A 
>> and B are done, but I wanted to avoid this.
>> 
>> I was thinking of creating a temp view, but I do not really like the temp 
>> aspect of it ;). Any idea (they are all worth sharing)
>> 
>> jg
>> 
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
> 
> 
> 



Re: org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
After investigation, it looks like my Spark 2.1.1 jars got corrupted during 
download - all good now... ;)

> On Jun 20, 2017, at 4:14 PM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Hey all,
> 
> i was giving a run to 2.1.1 and got an error on one of my test program: 
> 
> package net.jgp.labs.spark.l000_ingestion;
> 
> import java.util.Arrays;
> import java.util.List;
> 
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.types.IntegerType;
> 
> public class ArrayToDataset {
> 
>   public static void main(String[] args) {
>   ArrayToDataset app = new ArrayToDataset();
>   app.start();
>   }
> 
>   private void start() {
>   SparkSession spark = SparkSession.builder().appName("Array to 
> Dataset").master("local").getOrCreate();
> 
>   Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
>   List data = Arrays.asList(l);
>   Dataset df = spark.createDataFrame(data, 
> IntegerType.class);
> 
>   df.show();
>   }
> }
> 
> Eclipse is complaining that it cannot find 
> org.apache.spark.sql.types.IntegerType and after looking in the 
> spark-sql_2.11-2.1.1.jar jar, I could not find it as well:
> 
> 
> I looked at the 2.1.1 release notes as well, did not see anything. The 
> package is still in Javadoc: 
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html
>  
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html>
> 
> I must be missing something. Any hint?
> 
> Thanks!
> 
> jg
> 
> 
> 
> 
> 



Re: "Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Thanks Vadim & Jörn... I will look into those.

jg

> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com> 
> wrote:
> 
> You can launch one permanent spark context and then execute your jobs within 
> the context. And since they'll be running in the same context, they can share 
> data easily.
> 
> These two projects provide the functionality that you need:
> https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
>  
> <https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs>
> https://github.com/cloudera/livy#post-sessions 
> <https://github.com/cloudera/livy#post-sessions>
> 
> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hey,
> 
> Here is my need: program A does something on a set of data and produces 
> results, program B does that on another set, and finally, program C combines 
> the data of A and B. Of course, the easy way is to dump all on disk after A 
> and B are done, but I wanted to avoid this.
> 
> I was thinking of creating a temp view, but I do not really like the temp 
> aspect of it ;). Any idea (they are all worth sharing)
> 
> jg
> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 



org.apache.spark.sql.types missing from spark-sql_2.11-2.1.1.jar?

2017-06-20 Thread Jean Georges Perrin
Hey all,

i was giving a run to 2.1.1 and got an error on one of my test program: 

package net.jgp.labs.spark.l000_ingestion;

import java.util.Arrays;
import java.util.List;

import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.types.IntegerType;

public class ArrayToDataset {

public static void main(String[] args) {
ArrayToDataset app = new ArrayToDataset();
app.start();
}

private void start() {
SparkSession spark = SparkSession.builder().appName("Array to 
Dataset").master("local").getOrCreate();

Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
List data = Arrays.asList(l);
Dataset df = spark.createDataFrame(data, 
IntegerType.class);

df.show();
}
}

Eclipse is complaining that it cannot find 
org.apache.spark.sql.types.IntegerType and after looking in the 
spark-sql_2.11-2.1.1.jar jar, I could not find it as well:


I looked at the 2.1.1 release notes as well, did not see anything. The package 
is still in Javadoc: 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/package-summary.html
 


I must be missing something. Any hint?

Thanks!

jg







"Sharing" dataframes...

2017-06-20 Thread Jean Georges Perrin
Hey,

Here is my need: program A does something on a set of data and produces 
results, program B does that on another set, and finally, program C combines 
the data of A and B. Of course, the easy way is to dump all on disk after A and 
B are done, but I wanted to avoid this. 

I was thinking of creating a temp view, but I do not really like the temp 
aspect of it ;). Any idea (they are all worth sharing)

jg



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark JDBC] Does spark support read from remote Hive server via JDBC

2017-06-07 Thread Jean Georges Perrin
Do you have some other security in place like Kerberos or impersonation? It may 
affect your access. 

jg


> On Jun 7, 2017, at 02:15, Patrik Medvedev  wrote:
> 
> Hello guys,
> 
> I need to execute hive queries on remote hive server from spark, but for some 
> reasons i receive only column names(without data).
> Data available in table, i checked it via HUE and java jdbc connection.
> 
> Here is my code example:
> val test = spark.read
> .option("url", "jdbc:hive2://remote.hive.server:1/work_base")
> .option("user", "user")
> .option("password", "password")
> .option("dbtable", "some_table_with_data")
> .option("driver", "org.apache.hive.jdbc.HiveDriver")
> .format("jdbc")
> .load()
> test.show()
> 
> 
> Scala version: 2.11
> Spark version: 2.1.0, i also tried 2.1.1
> Hive version: CDH 5.7 Hive 1.1.1
> Hive JDBC version: 1.1.1
> 
> But this problem available on Hive with later versions, too.
> Could you help me with this issue, because i didn't find anything in mail 
> group answers and StackOverflow.
> Or could you help me find correct solution how to query remote hive from 
> spark?
> 
> -- 
> Cheers,
> Patrick


Re: checkpoint

2017-04-14 Thread Jean Georges Perrin
Sorry - can't help with PySpark, but here is some Java code which you may be 
able to transform to Python? 
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/

jg


> On Apr 14, 2017, at 07:18, issues solution  wrote:
> 
> Hi 
> somone can give me an complete example to work with chekpoint under Pyspark 
> 1.6 ?
> 
> thx
> regards


Re: Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin
Thanks Jörn,

I tried to super simplify my project so I can focus on the plumbing and I will 
add the existing code & library later. So, as of now, the project will not have 
a lot of meaning but will allow me to understand the job.

my call is:

String filename = "src/test/resources/simple.json";
SparkSession spark = 
SparkSession.builder().appName("X-parse").master("local").getOrCreate();
Dataset df = spark.read().format("x.CharCounterDataSource")
.option("char", "a") // count the number of 'a'
.load(filename); // local file (line 40 in the stacks below)
df.show();

Ideally, this should display something like:

+--+
| a|
+--+
|45|
+--+

Things gets trickier when I try to work on x.CharCounterDataSource:

I looked at 2 ways to do it:

1) one based on FileFormat:

public class CharCounterDataSource implements FileFormat {

@Override
public Function1<PartitionedFile, Iterator> 
buildReader(SparkSession arg0, StructType arg1,
StructType arg2, StructType arg3, Seq arg4, 
Map<String, String> arg5, Configuration arg6) {
// TODO Auto-generated method stub
return null;
}

@Override
public Function1<PartitionedFile, Iterator> 
buildReaderWithPartitionValues(SparkSession arg0,
StructType arg1, StructType arg2, StructType arg3, 
Seq arg4, Map<String, String> arg5,
Configuration arg6) {
// TODO Auto-generated method stub
return null;
}

@Override
public Option inferSchema(SparkSession arg0, Map<String, 
String> arg1, Seq arg2) {
// TODO Auto-generated method stub
return null;
}

@Override
public boolean isSplitable(SparkSession arg0, Map<String, String> arg1, 
Path arg2) {
// TODO Auto-generated method stub
return false;
}

@Override
public OutputWriterFactory prepareWrite(SparkSession arg0, Job arg1, 
Map<String, String> arg2, StructType arg3) {
// TODO Auto-generated method stub
return null;
}

@Override
public boolean supportBatch(SparkSession arg0, StructType arg1) {
// TODO Auto-generated method stub
return false;
}
}

I know it is an empty class (generated by Eclipse) and I am not expecting much 
out of it.

Running it says:

java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.DataSource.org$apache$spark$sql$execution$datasources$DataSource$$getOrInferFileFormatSchema(DataSource.scala:188)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at 
x.spark.datasource.counter.CharCounterDataSourceTest.test(CharCounterDataSourceTest.java:40)

Nothing surprising...

2) One based on RelationProvider:

public class CharCounterDataSource implements RelationProvider {

@Override
public BaseRelation createRelation(SQLContext arg0, Map<String, String> 
arg1) {
// TODO Auto-generated method stub
return null;
}

}

which fails too...

java.lang.NullPointerException
at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:40)
at 
org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:389)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:146)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
at x.CharCounterDataSourceTest.test(CharCounterDataSourceTest.java:40)


Don't get me wrong - I understand it fails - but what I need is "just one hint" 
to continue building the glue ;-)...

(Un)fortunately, we cannot use Scala...

jg

> On Mar 22, 2017, at 4:00 PM, Jörn Franke <jornfra...@gmail.com> wrote:
> 
> I think you can develop a Spark data source in Java, but you are right most 
> use for the glue Spark even if they have a Java library (this is what I did 
> for the project I open sourced). Coming back to your question, it is a little 
> bit difficult to assess the exact issue without the code.
> You could also try to first have a very simple Scala data source that works 
> and then translate it to Java and do the test there. You could then also post 
> the code here without disclosing confidential stuff.
> Or you try directly in Java a data source that returns always a row with one 
> column containing a String. I fear in any case you need to import some Scala 
> classes in Java and/or have some wrappers in Scala.
> 

Custom Spark data source in Java

2017-03-22 Thread Jean Georges Perrin

Hi,

I am trying to build a custom file data source for Spark, in Java. I have found 
numerous examples in Scala (including the CSV and XML data sources from 
Databricks), but I cannot bring Scala in this project. We also already have the 
parser itself written in Java, I just need to build the "glue" between the 
parser and Spark.

This is how I'd like to call it:

String filename = "src/test/resources/simple.x";

SparkSession spark = 
SparkSession.builder().appName("X-parse").master("local").getOrCreate();

Dataset df = spark.read().format("x.RandomDataSource")
.option("metadataTag", "schema") // hint to find schema
.option("dataTag", "data") // hint to find data
.load(filename); // local file
So far, I tried is implement x.RandomDataSource:

• Based on FileFormat, which makes the most sense, but I do not have a 
clue on how to build buildReader()...
• Based on RelationProvider, but same here...

It seems that in both case, the call is made to the right class, but I get into 
NPE because I do not provide much. Any hint or example would be greatly 
appreciated!

Thanks

jg

Re: eager? in dataframe's checkpoint

2017-02-02 Thread Jean Georges Perrin
i wrote this piece based on all that, hopefully it will help:
http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/ 
<http://jgp.net/2017/02/02/what-are-spark-checkpoints-on-dataframes/>

> On Jan 31, 2017, at 4:18 PM, Burak Yavuz <brk...@gmail.com> wrote:
> 
> Hi Koert,
> 
> When eager is true, we return you a new DataFrame that depends on the files 
> written out to the checkpoint directory.
> All previous operations on the checkpointed DataFrame are gone forever. You 
> basically start fresh. AFAIK, when eager is true, the method will not return 
> until the DataFrame is completely checkpointed. If you look at the 
> RDD.checkpoint implementation, the checkpoint location is updated 
> synchronously therefore during the count, `isCheckpointed` will be true.
> 
> Best,
> Burak
> 
> On Tue, Jan 31, 2017 at 12:52 PM, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> i understand that checkpoint cuts the lineage, but i am not fully sure i 
> understand the role of eager. 
> 
> eager simply seems to materialize the rdd early with a count, right after the 
> rdd has been checkpointed. but why is that useful? rdd.checkpoint is 
> asynchronous, so when the rdd.count happens most likely rdd.isCheckpointed 
> will be false, and the count will be on the rdd before it was checkpointed. 
> what is the benefit of that?
> 
> 
> On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz <brk...@gmail.com 
> <mailto:brk...@gmail.com>> wrote:
> Hi,
> 
> One of the goals of checkpointing is to cut the RDD lineage. Otherwise you 
> run into StackOverflowExceptions. If you eagerly checkpoint, you basically 
> cut the lineage there, and the next operations all depend on the checkpointed 
> DataFrame. If you don't checkpoint, you continue to build the lineage, 
> therefore while that lineage is being resolved, you may hit the 
> StackOverflowException.
> 
> HTH,
> Burak
> 
> On Thu, Jan 26, 2017 at 10:36 AM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hey Sparkers,
> 
> Trying to understand the Dataframe's checkpoint (not in the context of 
> streaming) 
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
>  
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)>
> 
> What is the goal of the eager flag?
> 
> Thanks!
> 
> jg
> 
> 
> 



Spark 2 + Java + UDF + unknown return type...

2017-02-02 Thread Jean Georges Perrin
Hi fellow Sparkans,

I am building a UDF (in Java) that can return various data types, basically the 
signature of the function itself is:

public Object call(String a, Object b, String c, Object d, String e) 
throws Exception

When I register my function, I need to provide a type, e.g.:

spark.udf().register("f2", new Udf5(), DataTypes.LongType);

In my test it is a long now, but can become a string or a float. Of course, I 
do not know the expected return type before I call the function, which I call 
like:

df = df.selectExpr("*", "f2('x1', x, 'c2', y, 'op') as op");

Is there a way to have an Object being returned from a UDF and to store an 
Object in a Dataset/dataframe? I don't need to know the datatype at that point 
and can leave it hanging for now? Or should I play it safe and always return a 
DataTypes.StringType (and then try to transform it if needed)?

I hope I am clear enough :).

Thanks for any tip/idea/comment...

jg

eager? in dataframe's checkpoint

2017-01-26 Thread Jean Georges Perrin
Hey Sparkers,

Trying to understand the Dataframe's checkpoint (not in the context of 
streaming) 
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/Dataset.html#checkpoint(boolean)
 


What is the goal of the eager flag?

Thanks!

jg

Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Thanks Luciano - I think this is my issue :(

> On Oct 10, 2016, at 2:08 PM, Luciano Resende <luckbr1...@gmail.com> wrote:
> 
> Please take a look at 
> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets 
> <http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets>
> 
> Particularly the note at the required format :
> 
> Note that the file that is offered as a json file is not a typical JSON file. 
> Each line must contain a separate, self-contained valid JSON object. As a 
> consequence, a regular multi-line JSON file will most often fail.
> 
> 
> 
> On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi folks,
> 
> I am trying to parse JSON arrays and it’s getting a little crazy (for me at 
> least)…
> 
> 1)
> If my JSON is:
> {"vals":[100,500,600,700,800,200,900,300]}
> 
> I get:
> ++
> |vals|
> ++
> |[100, 500, 600, 7...|
> ++
> 
> root
>  |-- vals: array (nullable = true)
>  ||-- element: long (containsNull = true)
> 
> and I am :)
> 
> 2)
> If my JSON is:
> [100,500,600,700,800,200,900,300]
> 
> I get:
> ++
> | _corrupt_record|
> ++
> |[100,500,600,700,...|
> ++
> 
> root
>  |-- _corrupt_record: string (nullable = true)
> 
> Both are legit JSON structures… Do you think that #2 is a bug?
> 
> jg
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> Luciano Resende
> http://twitter.com/lresende1975 <http://twitter.com/lresende1975>
> http://lresende.blogspot.com/ <http://lresende.blogspot.com/>


Re: JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Thanks!

I am ok with strict rules (despite being French), but even:
[{
"red": "#f00", 
"green": "#0f0"
},{
"red": "#f01", 
"green": "#0f1"
}]

is not going through…

Is there a way to see what he does not like?

the JSON parser has been pretty good to me until recently.


> On Oct 10, 2016, at 12:59 PM, Sudhanshu Janghel <> wrote:
> 
> As far as my experience goes spark can parse only certain types of Json 
> correctly not all and has strict Parsing rules unlike python
> 
> 
> On Oct 10, 2016 6:57 PM, "Jean Georges Perrin" <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi folks,
> 
> I am trying to parse JSON arrays and it’s getting a little crazy (for me at 
> least)…
> 
> 1)
> If my JSON is:
> {"vals":[100,500,600,700,800,200,900,300]}
> 
> I get:
> ++
> |vals|
> ++
> |[100, 500, 600, 7...|
> ++
> 
> root
>  |-- vals: array (nullable = true)
>  ||-- element: long (containsNull = true)
> 
> and I am :)
> 
> 2)
> If my JSON is:
> [100,500,600,700,800,200,900,300]
> 
> I get:
> ++
> | _corrupt_record|
> ++
> |[100,500,600,700,...|
> ++
> 
> root
>  |-- _corrupt_record: string (nullable = true)
> 
> Both are legit JSON structures… Do you think that #2 is a bug?
> 
> jg
> 
> 
> 
> 
> 
> 
> Disclaimer: The information in this email is confidential and may be legally 
> privileged. Access to this email by anyone other than the intended addressee 
> is unauthorized. If you are not the intended recipient of this message, any 
> review, disclosure, copying, distribution, retention, or any action taken or 
> omitted to be taken in reliance on it is prohibited and may be unlawful.



JSON Arrays and Spark

2016-10-10 Thread Jean Georges Perrin
Hi folks,

I am trying to parse JSON arrays and it’s getting a little crazy (for me at 
least)…

1)
If my JSON is:
{"vals":[100,500,600,700,800,200,900,300]}

I get:
++
|vals|
++
|[100, 500, 600, 7...|
++

root
 |-- vals: array (nullable = true)
 ||-- element: long (containsNull = true)

and I am :)

2)
If my JSON is:
[100,500,600,700,800,200,900,300]

I get:
++
| _corrupt_record|
++
|[100,500,600,700,...|
++

root
 |-- _corrupt_record: string (nullable = true)

Both are legit JSON structures… Do you think that #2 is a bug?

jg







Re: Inserting New Primary Keys

2016-10-10 Thread Jean Georges Perrin
Is there only one process adding rows? because this seems a little risky if you 
have multiple threads doing that… 

> On Oct 8, 2016, at 1:43 PM, Benjamin Kim  wrote:
> 
> Mich,
> 
> After much searching, I found and am trying to use “SELECT ROW_NUMBER() 
> OVER() + b.id_max AS id, a.* FROM source a CROSS JOIN (SELECT 
> COALESCE(MAX(id),0) AS id_max FROM tmp_destination) b”. I think this should 
> do it.
> 
> Thanks,
> Ben
> 
> 
>> On Oct 8, 2016, at 9:48 AM, Mich Talebzadeh > > wrote:
>> 
>> can you get the max value from the current  table and start from MAX(ID) + 1 
>> assuming it is a numeric value (it should be)?
>> 
>> HTH
>> 
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 8 October 2016 at 17:42, Benjamin Kim > > wrote:
>> I have a table with data already in it that has primary keys generated by 
>> the function monotonicallyIncreasingId. Now, I want to insert more data into 
>> it with primary keys that will auto-increment from where the existing data 
>> left off. How would I do this? There is no argument I can pass into the 
>> function monotonicallyIncreasingId to seed it.
>> 
>> Thanks,
>> Ben
>> 
>> 
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> 
>> 
>> 
> 



Raleigh, Durham, and around...

2016-08-04 Thread Jean Georges Perrin
Hi,

With some friends, we try to develop the Apache Spark community in the Triangle 
area of North Carolina, USA. If you are from there, feel free to join our Slack 
team: http://oplo.io/td. Danny Siegle has also organized a lot of meet ups 
around the edX courses (see 
https://www.meetup.com/Triangle-Apache-Spark-Meetup/ 
).

Sorry for the folks not in NC, but you should come over, it's a great place to 
live :)

jg




Re: Java Recipes for Spark

2016-07-31 Thread Jean Georges Perrin
Thanks Guys - I really appreciate :)... If you have any idea of something 
missing, I'll gladly add it.

(and yeah, come on! Is that some kind of primitive racism or what: Java rocks! 
What are those language where you can turn a list to a string and back to an 
object. #StrongTypingRules)

> On Jul 30, 2016, at 12:19 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote:
> 
> +1 for the Java love :-)
> 
> 
> On 30-Jul-2016 4:39 AM, "Renato Perini" <renato.per...@gmail.com 
> <mailto:renato.per...@gmail.com>> wrote:
> Not only very useful, but finally some Java love :-)
> 
> Thank you.
> 
> 
> Il 29/07/2016 22:30, Jean Georges Perrin ha scritto:
> Sorry if this looks like a shameless self promotion, but some of you asked me 
> to say when I'll have my Java recipes for Apache Spark updated. It's done 
> here: http://jgp.net/2016/07/22/spark-java-recipes/ 
> <http://jgp.net/2016/07/22/spark-java-recipes/> and in the GitHub repo.
> 
> Enjoy / have a great week-end.
> 
> jg
> 
> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 



Java Recipes for Spark

2016-07-29 Thread Jean Georges Perrin
Sorry if this looks like a shameless self promotion, but some of you asked me 
to say when I'll have my Java recipes for Apache Spark updated. It's done here: 
http://jgp.net/2016/07/22/spark-java-recipes/ 
 and in the GitHub repo. 

Enjoy / have a great week-end.

jg




java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Jean Georges Perrin
I try to build a simple DataFrame that can be used for ML


SparkConf conf = new SparkConf().setAppName("Simple prediction 
from Text File").setMaster("local");
SparkContext sc = new SparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);

sqlContext.udf().register("vectorBuilder", new VectorBuilder(), 
new VectorUDT());

String filename = "data/tuple-data-file.csv";
StructType schema = new StructType(
new StructField[] { new StructField("C0", 
DataTypes.StringType, false, Metadata.empty()),
new StructField("C1", 
DataTypes.IntegerType, false, Metadata.empty()),
new StructField("features", new 
VectorUDT(), false, Metadata.empty()), });

DataFrame df = 
sqlContext.read().format("com.databricks.spark.csv").schema(schema).option("header",
 "false")
.load(filename);
df = df.withColumn("label", df.col("C0")).drop("C0");
df = df.withColumn("value", df.col("C1")).drop("C1");
df.printSchema();
Returns:
root
 |-- features: vector (nullable = false)
 |-- label: string (nullable = false)
 |-- value: integer (nullable = false)
df.show();
Returns:

java.lang.RuntimeException: Unsupported type: vector
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/07/24 12:46:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
localhost): java.lang.RuntimeException: Unsupported type: vector
at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
at 
com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at 

UDF to build a Vector?

2016-07-24 Thread Jean Georges Perrin

Hi,

Here is my UDF that should build a VectorUDT. How do I actually make that the 
value is in the vector?

package net.jgp.labs.spark.udf;

import org.apache.spark.mllib.linalg.VectorUDT;
import org.apache.spark.sql.api.java.UDF1;

public class VectorBuilder implements UDF1 {
private static final long serialVersionUID = -2991355883253063841L;

@Override
public VectorUDT call(Integer t1) throws Exception {
return new VectorUDT();
}

}

i plan on having this used by a linear regression in ML...

Re: Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin
You're right, it's the save behavior... Oh well... I wanted something easy :(

> On Jul 22, 2016, at 12:41 PM, Everett Anderson <ever...@nuna.com.invalid 
> <mailto:ever...@nuna.com.invalid>> wrote:
> 
> Actually, sorry, my mistake, you're calling
> 
>   DataFrame df = sqlContext.createDataFrame(data, 
> org.apache.spark.sql.types.NumericType.class);
> 
> and giving it a list of objects which aren't NumericTypes, but the wildcards 
> in the signature let it happen.
> 
> I'm curious what'd happen if you gave it Integer.class, but I suspect it 
> still won't work because Integer may not have the bean-style getters.
> 
> 
> On Fri, Jul 22, 2016 at 9:37 AM, Everett Anderson <ever...@nuna.com 
> <mailto:ever...@nuna.com>> wrote:
> Hey,
> 
> I think what's happening is that you're calling this createDataFrame method 
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html#createDataFrame(java.util.List,%20java.lang.Class)>:
> 
> createDataFrame(java.util.List data, java.lang.Class beanClass) 
> 
> which expects a JavaBean-style class with get and set methods for the 
> members, but Integer doesn't have such a getter. 
> 
> I bet there's an easier way if you just want a single-column DataFrame of a 
> primitive type, but one way that would work is to manually construct the Rows 
> using RowFactory.create() 
> <https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/RowFactory.html#create(java.lang.Object...)>
>  and assemble the DataFrame from that like
> 
> List rows = convert your List to this in a loop with 
> RowFactory.create()
> 
> StructType schema = DataTypes.createStructType(Collections.singletonList(
>  DataTypes.createStructField("int_field", DataTypes.IntegerType, true)));
> 
> DataFrame intDataFrame = sqlContext.createDataFrame(rows, schema);
> 
> 
> 
> On Fri, Jul 22, 2016 at 7:53 AM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> 
> 
> I am trying to build a DataFrame from a list, here is the code:
> 
>   private void start() {
>   SparkConf conf = new SparkConf().setAppName("Data Set from 
> Array").setMaster("local");
>   SparkContext sc = new SparkContext(conf);
>   SQLContext sqlContext = new SQLContext(sc);
> 
>   Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
>   List data = Arrays.asList(l);
> 
>   System.out.println(data);
>   
>   DataFrame df = sqlContext.createDataFrame(data, 
> org.apache.spark.sql.types.NumericType.class);
>   df.show();
>   }
> 
> My result is (unpleasantly):
> 
> [1, 2, 3, 4, 5, 6, 7]
> ++
> ||
> ++
> ||
> ||
> ||
> ||
> ||
> ||
> ||
> ++
> 
> I also tried with:
> org.apache.spark.sql.types.NumericType.class
> org.apache.spark.sql.types.IntegerType.class
> org.apache.spark.sql.types.ArrayType.class
> 
> I am probably missing something super obvious :(
> 
> Thanks!
> 
> jg
> 
> 
> 
> 



Creating a DataFrame from scratch

2016-07-22 Thread Jean Georges Perrin


I am trying to build a DataFrame from a list, here is the code:

private void start() {
SparkConf conf = new SparkConf().setAppName("Data Set from 
Array").setMaster("local");
SparkContext sc = new SparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);

Integer[] l = new Integer[] { 1, 2, 3, 4, 5, 6, 7 };
List data = Arrays.asList(l);

System.out.println(data);

DataFrame df = sqlContext.createDataFrame(data, 
org.apache.spark.sql.types.NumericType.class);
df.show();
}

My result is (unpleasantly):

[1, 2, 3, 4, 5, 6, 7]
++
||
++
||
||
||
||
||
||
||
++

I also tried with:
org.apache.spark.sql.types.NumericType.class
org.apache.spark.sql.types.IntegerType.class
org.apache.spark.sql.types.ArrayType.class

I am probably missing something super obvious :(

Thanks!

jg




Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Thanks Marco - I like the idea of sticking with DataFrames ;)


> On Jul 22, 2016, at 7:07 AM, Marco Mistroni <mmistr...@gmail.com> wrote:
> 
> Hello Jean
>  you can take ur current DataFrame and send them to mllib (i was doing that 
> coz i dindt know the ml package),but the process is littlebit cumbersome
> 
> 
> 1. go from DataFrame to Rdd of Rdd of [LabeledVectorPoint]
> 2. run your ML model
> 
> i'd suggest you stick to DataFrame + ml package :)
> 
> hth
> 
> 
> 
> On Fri, Jul 22, 2016 at 4:41 AM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi,
> 
> I am looking for some really super basic examples of MLlib (like a linear 
> regression over a list of values) in Java. I have found a few, but I only saw 
> them using JavaRDD... and not DataFrame.
> 
> I was kind of hoping to take my current DataFrame and send them in MLlib. Am 
> I too optimistic? Do you know/have any example like that?
> 
> Thanks!
> 
> jg
> 
> 
> Jean Georges Perrin
> j...@jgp.net <mailto:j...@jgp.net> / @jgperrin
> 
> 
> 
> 
> 



Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Thanks Bryan - I keep forgetting about the examples... This is almost it :) I 
can work with that :)


> On Jul 22, 2016, at 1:39 AM, Bryan Cutler <cutl...@gmail.com> wrote:
> 
> Hi JG,
> 
> If you didn't know this, Spark MLlib has 2 APIs, one of which uses 
> DataFrames.  Take a look at this example 
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java
>  
> <https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaLinearRegressionWithElasticNetExample.java>
> 
> This example uses a Dataset, which is type equivalent to a DataFrame.
> 
> 
> On Thu, Jul 21, 2016 at 8:41 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi,
> 
> I am looking for some really super basic examples of MLlib (like a linear 
> regression over a list of values) in Java. I have found a few, but I only saw 
> them using JavaRDD... and not DataFrame.
> 
> I was kind of hoping to take my current DataFrame and send them in MLlib. Am 
> I too optimistic? Do you know/have any example like that?
> 
> Thanks!
> 
> jg
> 
> 
> Jean Georges Perrin
> j...@jgp.net <mailto:j...@jgp.net> / @jgperrin
> 
> 
> 
> 
> 



Re: MLlib, Java, and DataFrame

2016-07-22 Thread Jean Georges Perrin
Hi Jules,

Thanks but not really: I know what DataFrames are and I actually use them - 
specially as the RDD will slowly fade. A lot of the example I see are focusing 
on cleaning / prep the data, which is an important part, but not really on 
"after"... Sorry if I am not completely clear.

> On Jul 22, 2016, at 1:08 AM, Jules Damji <ju...@databricks.com> wrote:
> 
> Is this what you had in mind?
> 
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>  
> <https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html>
> 
> Cheers 
> Jules 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
> 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> On Jul 21, 2016, at 8:41 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> 
>> Hi,
>> 
>> I am looking for some really super basic examples of MLlib (like a linear 
>> regression over a list of values) in Java. I have found a few, but I only 
>> saw them using JavaRDD... and not DataFrame.
>> 
>> I was kind of hoping to take my current DataFrame and send them in MLlib. Am 
>> I too optimistic? Do you know/have any example like that?
>> 
>> Thanks!
>> 
>> jg
>> 
>> 
>> Jean Georges Perrin
>> j...@jgp.net <mailto:j...@jgp.net> / @jgperrin
>> 
>> 
>> 
>> 



MLlib, Java, and DataFrame

2016-07-21 Thread Jean Georges Perrin
Hi,

I am looking for some really super basic examples of MLlib (like a linear 
regression over a list of values) in Java. I have found a few, but I only saw 
them using JavaRDD... and not DataFrame.

I was kind of hoping to take my current DataFrame and send them in MLlib. Am I 
too optimistic? Do you know/have any example like that?

Thanks!

jg


Jean Georges Perrin
j...@jgp.net <mailto:j...@jgp.net> / @jgperrin






Re: Understanding spark concepts cluster, master, slave, job, stage, worker, executor, task

2016-07-20 Thread Jean Georges Perrin
Hey,

I love when questions are numbered, it's easier :)

1) Yes (but I am not an expert)
2) You don't control... One of my process is going to 8k tasks, so...
3) Yes, if you have HT, it double. My servers have 12 cores, but HT, so it 
makes 24.
4) From my understanding: Slave is the logical computational unit and Worker is 
really the one doing the job. 
5) Dunnoh
6) Dunnoh

> On Jul 20, 2016, at 1:30 PM, Sachin Mittal  wrote:
> 
> Hi,
> I was able to build and run my spark application via spark submit.
> 
> I have understood some of the concepts by going through the resources at 
> https://spark.apache.org  but few doubts still 
> remain. I have few specific questions and would be glad if someone could 
> share some light on it.
> 
> So I submitted the application using spark.masterlocal[*] and I have a 8 
> core PC.
> 
> - What I understand is that application is called as job. Since mine had two 
> stages it gets divided into 2 stages and each stage had number of tasks which 
> ran in parallel.
> Is this understanding correct.
> 
> - What I notice is that each stage is further divided into 262 tasks From 
> where did this number 262 came from. Is this configurable. Would increasing 
> this number improve performance.
> 
> - Also I see that the tasks are run in parallel in set of 8. Is this because 
> I have a 8 core PC.
> 
> - What is the difference or relation between slave and worker. When I did 
> spark-submit did it start 8 slaves or worker threads?
> 
> - I see all worker threads running in one single JVM. Is this because I did 
> not start  slaves separately and connect it to a single master cluster 
> manager. If I had done that then each worker would have run in its own JVM.
> 
> - What is the relationship between worker and executor. Can a worker have 
> more than one executors? If yes then how do we configure that. Does all 
> executor run in the worker JVM and are independent threads.
> 
> I suppose that is all for now. Would appreciate any response.Will add 
> followup questions if any.
> 
> Thanks
> Sachin
> 
> 



Re: Spark streaming takes longer time to read json into dataframes

2016-07-15 Thread Jean Georges Perrin
Do you need it on disk or just push it to memory? Can you try to increase 
memory or # of cores (I know it sounds basic)

> On Jul 15, 2016, at 11:43 PM, Diwakar Dhanuskodi 
>  wrote:
> 
> Hello, 
> 
> I have 400K json messages pulled from Kafka into spark streaming using 
> DirectStream approach. Size of 400K messages is around 5G.  Kafka topic is 
> single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to 
> convert  rdd into dataframe. It takes almost 2.3 minutes to convert into 
> dataframe. 
> 
> I am running in Yarn client mode with executor memory as 15G and executor 
> cores as 2.
> 
> Caching rdd before converting into dataframe  doesn't change processing time. 
> Whether introducing hash partitions inside foreachRDD  will help? (or) Will 
> partitioning topic and have more than one DirectStream help?. How can I 
> approach this situation to reduce time in converting to dataframe.. 
> 
> Regards, 
> Diwakar. 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Hey Mich,


Oh well, you know, us humble programmers try to modestly understand what the 
brilliant data scientists are designing and, I can assure you that it is not 
easy.

Basically the way I use Spark is in 2 ways:

1) As a developer
I just embed the Spark binaries (jars) in my Maven POM. In the app, when I need 
to have Spark do something, I just call the local's master (quick example here: 
http://jgp.net/2016/06/26/your-very-first-apache-spark-application/ 
<http://jgp.net/2016/06/26/your-very-first-apache-spark-application/>).

Pro: this is the super-duper easy & lazy way, works like a charm, setup under 5 
minutes with one arm in your back and being blindfolded.
Con: well, I have a MacBook Air, a nice MacBook Air, but still it is only a 
MacBook Air, with 8GB or RAM and 2 cores... My analysis never finished (but a 
subset does).

2) As a database
Ok, some will probably find that shocking, but I used Spark as a database on a 
distance computer (my sweet Micha). The app connects to Spark, tells it what to 
do, and the application "consumes" the data crunching done by Spark on Micha (a 
bit more of the architecture there: 
http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/ 
<http://jgp.net/2016/07/14/chapel-hill-we-dont-have-a-problem/>). 

Pro: this can scale like crazy (I have benchmarks scheduled)
Con: well... after you went through all the issues I had, I don't see much 
issues anymore (except that I still can't set the # of executors -- which 
starts to make sense).

3) As a remote batch processor
You prepare your "batch" as a jar. I remember using mainframes this way (and 
using SAS). 

Pro: very friendly to data scientists / researchers as they are used to this 
batch model.
Con: you need to prepare the batch, send it... The jar also needs to do with 
the results: save them in a database? send a mail? send a PDF? call the police?

Do you agree? Any other opinion?

I am not saying one is better than the other, just trying to get a "big 
picture".

jg




> On Jul 15, 2016, at 2:13 PM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Interesting
> 
> For some stuff I create an uber jar file and use that against spark-submit. I 
> have not attempted to start the cluster from through application.
> 
> 
> I tend to use a shell program (actually a k-shell) to compile it via maven or 
> sbt and then run it accordingly. In general you can parameterise everything 
> for runtime parameters say --driver-memory ${DRIVER_MEMORY} to practically 
> any other parameter . That way I find it more flexible because I can submit 
> the jar file and the class in any environment and adjust those runtime 
> parameters accordingly.  There are certain advantages to using spark-submit, 
> for example, since driver-memory setting encapsulates the JVM, you will need 
> to set the amount of driver memory for any non-default value before starting 
> JVM by providing the value in spark-submit.
> 
> I would be keen in hearing the pros and cons of the above approach. I am sure 
> you programmers (Scala/Java) know much more than me :)
> 
> Cheers
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 15 July 2016 at 16:42, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> lol - young padawan I am and path to knowledge seeking I am...
> 
> And on this path I also tried (without luck)...
> 
>   if (restId == 0) {
>   conf = conf.setExecutorEnv("spark.executor.cores", 
> "22");
>   } else {
>   conf = conf.setExecutorEnv("spark.executor.cores", "2");
>   }
> 
> and
> 
>   if (restId == 0) {
>   conf.setExecutorEnv("spark.executor.cores", "22");
>   } else {
>   conf.setExecutorEnv("spark.executor.cores", "2");
>   }
> 
> the only annoying thing I see is we designed some of the work to be handled 
> by the driver/client app and we will have to rethink a bit the design of the 
> app for that...
> 
> 

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
lol - young padawan I am and path to knowledge seeking I am...

And on this path I also tried (without luck)...

if (restId == 0) {
conf = conf.setExecutorEnv("spark.executor.cores", 
"22");
} else {
conf = conf.setExecutorEnv("spark.executor.cores", "2");
}

and

if (restId == 0) {
conf.setExecutorEnv("spark.executor.cores", "22");
} else {
conf.setExecutorEnv("spark.executor.cores", "2");
}

the only annoying thing I see is we designed some of the work to be handled by 
the driver/client app and we will have to rethink a bit the design of the app 
for that...


> On Jul 15, 2016, at 11:34 AM, Daniel Darabos 
> <daniel.dara...@lynxanalytics.com> wrote:
> 
> Mich's invocation is for starting a Spark application against an already 
> running Spark standalone cluster. It will not start the cluster for you.
> 
> We used to not use "spark-submit", but we started using it when it solved 
> some problem for us. Perhaps that day has also come for you? :)
> 
> On Fri, Jul 15, 2016 at 5:14 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> I don't use submit: I start my standalone cluster and connect to it remotely. 
> Is that a bad practice?
> 
> I'd like to be able to it dynamically as the system knows whether it needs 
> more or less resources based on its own  context
> 
>> On Jul 15, 2016, at 10:55 AM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> You can also do all this at env or submit time with spark-submit which I 
>> believe makes it more flexible than coding in.
>> 
>> Example
>> 
>> ${SPARK_HOME}/bin/spark-submit \
>> --packages com.databricks:spark-csv_2.11:1.3.0 \
>> --driver-memory 2G \
>> --num-executors 2 \
>> --executor-cores 3 \
>> --executor-memory 2G \
>> --master spark://50.140.197.217:7077 
>> <http://50.140.197.217:7077/> \
>> --conf "spark.scheduler.mode=FAIR" \
>> --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
>> -XX:+PrintGCTimeStamps" \
>> --jars 
>> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>> --class "${FILE_NAME}" \
>> --conf "spark.ui.port=${SP}" \
>>  
>> HTH
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 15 July 2016 at 13:48, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> Merci Nihed, this is one of the tests I did :( still not working
>> 
>> 
>> 
>>> On Jul 15, 2016, at 8:41 AM, nihed mbarek <nihe...@gmail.com 
>>> <mailto:nihe...@gmail.com>> wrote:
>>> 
>>> can you try with : 
>>> SparkConf conf = new SparkConf().setAppName("NC Eatery 
>>> app").set("spark.executor.memory", "4g")
>>> .setMaster("spark://10.0.100.120:7077 <>");
>>> if (restId == 0) {
>>> conf = conf.set("spark.executor.cores", "22");
>>> } else {
>>> conf = conf.set("spark.executor.cores", "2");
>>> }
>>> JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>> 
>>> On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <j...@jgp.net 
>>> <mailto:j...@jgp.net>> wrote:
>>> Hi,
>>> 
>>> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
>>

Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
I don't use submit: I start my standalone cluster and connect to it remotely. 
Is that a bad practice?

I'd like to be able to it dynamically as the system knows whether it needs more 
or less resources based on its own  context

> On Jul 15, 2016, at 10:55 AM, Mich Talebzadeh <mich.talebza...@gmail.com> 
> wrote:
> 
> Hi,
> 
> You can also do all this at env or submit time with spark-submit which I 
> believe makes it more flexible than coding in.
> 
> Example
> 
> ${SPARK_HOME}/bin/spark-submit \
> --packages com.databricks:spark-csv_2.11:1.3.0 \
> --driver-memory 2G \
> --num-executors 2 \
> --executor-cores 3 \
> --executor-memory 2G \
> --master spark://50.140.197.217:7077 
> <http://50.140.197.217:7077/> \
> --conf "spark.scheduler.mode=FAIR" \
> --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails 
> -XX:+PrintGCTimeStamps" \
> --jars 
> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
> --class "${FILE_NAME}" \
> --conf "spark.ui.port=${SP}" \
>  
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 15 July 2016 at 13:48, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Merci Nihed, this is one of the tests I did :( still not working
> 
> 
> 
>> On Jul 15, 2016, at 8:41 AM, nihed mbarek <nihe...@gmail.com 
>> <mailto:nihe...@gmail.com>> wrote:
>> 
>> can you try with : 
>> SparkConf conf = new SparkConf().setAppName("NC Eatery 
>> app").set("spark.executor.memory", "4g")
>>  .setMaster("spark://10.0.100.120:7077 <>");
>>  if (restId == 0) {
>>  conf = conf.set("spark.executor.cores", "22");
>>  } else {
>>  conf = conf.set("spark.executor.cores", "2");
>>  }
>>  JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>> 
>> On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> Hi,
>> 
>> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
>> 
>> My process uses all the cores of my server (good), but I am trying to limit 
>> it so I can actually submit a second job.
>> 
>> I tried
>> 
>>  SparkConf conf = new SparkConf().setAppName("NC Eatery 
>> app").set("spark.executor.memory", "4g")
>>  .setMaster("spark://10.0.100.120:7077 <>");
>>  if (restId == 0) {
>>  conf = conf.set("spark.executor.cores", "22");
>>  } else {
>>  conf = conf.set("spark.executor.cores", "2");
>>  }
>>  JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>> 
>> and
>> 
>>  SparkConf conf = new SparkConf().setAppName("NC Eatery 
>> app").set("spark.executor.memory", "4g")
>>  .setMaster("spark://10.0.100.120:7077 <>");
>>  if (restId == 0) {
>>  conf.set("spark.executor.cores", "22");
>>  } else {
>>  conf.set("spark.executor.cores", "2");
>>  }
>>  JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>> 
>> but it does not seem to take it. Any hint?
>> 
>> jg
>> 
>> 
>> 
>> 
>> 
>> -- 
>> 
>> M'BAREK Med Nihed,
>> Fedora Ambassador, TUNISIA, Northern Africa
>> http://www.nihed.com <http://www.nihed.com/>
>> 
>>  <http://tn.linkedin.com/in/nihed>
>> 
> 
> 



Re: spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Merci Nihed, this is one of the tests I did :( still not working



> On Jul 15, 2016, at 8:41 AM, nihed mbarek <nihe...@gmail.com> wrote:
> 
> can you try with : 
> SparkConf conf = new SparkConf().setAppName("NC Eatery 
> app").set("spark.executor.memory", "4g")
>   .setMaster("spark://10.0.100.120:7077 <>");
>   if (restId == 0) {
>   conf = conf.set("spark.executor.cores", "22");
>   } else {
>   conf = conf.set("spark.executor.cores", "2");
>   }
>       JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
> 
> On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi,
> 
> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
> 
> My process uses all the cores of my server (good), but I am trying to limit 
> it so I can actually submit a second job.
> 
> I tried
> 
>   SparkConf conf = new SparkConf().setAppName("NC Eatery 
> app").set("spark.executor.memory", "4g")
>   .setMaster("spark://10.0.100.120:7077 <>");
>   if (restId == 0) {
>   conf = conf.set("spark.executor.cores", "22");
>   } else {
>   conf = conf.set("spark.executor.cores", "2");
>   }
>   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
> 
> and
> 
>   SparkConf conf = new SparkConf().setAppName("NC Eatery 
> app").set("spark.executor.memory", "4g")
>   .setMaster("spark://10.0.100.120:7077 <>");
>   if (restId == 0) {
>   conf.set("spark.executor.cores", "22");
>   } else {
>   conf.set("spark.executor.cores", "2");
>   }
>   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
> 
> but it does not seem to take it. Any hint?
> 
> jg
> 
> 
> 
> 
> 
> -- 
> 
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com <http://www.nihed.com/>
> 
>  <http://tn.linkedin.com/in/nihed>
> 



spark.executor.cores

2016-07-15 Thread Jean Georges Perrin
Hi,

Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores

My process uses all the cores of my server (good), but I am trying to limit it 
so I can actually submit a second job.

I tried

SparkConf conf = new SparkConf().setAppName("NC Eatery 
app").set("spark.executor.memory", "4g")
.setMaster("spark://10.0.100.120:7077");
if (restId == 0) {
conf = conf.set("spark.executor.cores", "22");
} else {
conf = conf.set("spark.executor.cores", "2");
}
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

and

SparkConf conf = new SparkConf().setAppName("NC Eatery 
app").set("spark.executor.memory", "4g")
.setMaster("spark://10.0.100.120:7077");
if (restId == 0) {
conf.set("spark.executor.cores", "22");
} else {
conf.set("spark.executor.cores", "2");
}
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

but it does not seem to take it. Any hint?

jg




Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
I use it as a standalone cluster.

I run it through start-master, then start-slave. I only have one slave now, but 
I will probably have a few soon.

The "application" is run on a separate box.

When everything was running on my mac, i was in local mode, but i never setup 
anything in local mode. Going "production" was a little more complex that I 
thought.

> On Jul 13, 2016, at 10:35 PM, Chanh Le <giaosu...@gmail.com> wrote:
> 
> Hi Jean,
> How do you run your Spark Application? Local Mode, Cluster Mode? 
> If you run in local mode did you use —driver-memory and —executor-memory 
> because in local mode your setting about executor and driver didn’t work that 
> you expected.
> 
> 
> 
> 
>> On Jul 14, 2016, at 8:43 AM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> 
>> Looks like replacing the setExecutorEnv() by set() did the trick... let's 
>> see how fast it'll process my 50x 10ˆ15 data points...
>> 
>>> On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin <j...@jgp.net 
>>> <mailto:j...@jgp.net>> wrote:
>>> 
>>> I have added:
>>> 
>>> SparkConf conf = new 
>>> SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g")
>>> .setMaster("spark://10.0.100.120:7077 
>>> ");
>>> 
>>> but it did not change a thing
>>> 
>>>> On Jul 13, 2016, at 9:14 PM, Jean Georges Perrin <j...@jgp.net 
>>>> <mailto:j...@jgp.net>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I have a Java memory issue with Spark. The same application working on my 
>>>> 8GB Mac crashes on my 72GB Ubuntu server...
>>>> 
>>>> I have changed things in the conf file, but it looks like Spark does not 
>>>> care, so I wonder if my issues are with the driver or executor.
>>>> 
>>>> I set:
>>>> 
>>>> spark.driver.memory 20g
>>>> spark.executor.memory   20g
>>>> And, whatever I do, the crash is always at the same spot in the app, which 
>>>> makes me think that it is a driver problem.
>>>> 
>>>> The exception I get is:
>>>> 
>>>> 16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 
>>>> 208, micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
>>>> at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
>>>> at java.nio.CharBuffer.allocate(CharBuffer.java:335)
>>>> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
>>>> at org.apache.hadoop.io.Text.decode(Text.java:412)
>>>> at org.apache.hadoop.io.Text.decode(Text.java:389)
>>>> at org.apache.hadoop.io.Text.toString(Text.java:280)
>>>> at 
>>>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>>>> at 
>>>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>> at 
>>>> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>>>> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>>>> at 
>>>> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>>>> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>>>> at 
>>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
>>>> at 
>>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
>>>> at 
>>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
>>>> at 
>>>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
>>>> at 
>>>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>>>

Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Looks like replacing the setExecutorEnv() by set() did the trick... let's see 
how fast it'll process my 50x 10ˆ15 data points...

> On Jul 13, 2016, at 9:24 PM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> I have added:
> 
>   SparkConf conf = new 
> SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g")
>   .setMaster("spark://10.0.100.120:7077 
> ");
> 
> but it did not change a thing
> 
>> On Jul 13, 2016, at 9:14 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> 
>> Hi,
>> 
>> I have a Java memory issue with Spark. The same application working on my 
>> 8GB Mac crashes on my 72GB Ubuntu server...
>> 
>> I have changed things in the conf file, but it looks like Spark does not 
>> care, so I wonder if my issues are with the driver or executor.
>> 
>> I set:
>> 
>> spark.driver.memory 20g
>> spark.executor.memory   20g
>> And, whatever I do, the crash is always at the same spot in the app, which 
>> makes me think that it is a driver problem.
>> 
>> The exception I get is:
>> 
>> 16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 208, 
>> micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
>> at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
>> at java.nio.CharBuffer.allocate(CharBuffer.java:335)
>> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
>> at org.apache.hadoop.io.Text.decode(Text.java:412)
>> at org.apache.hadoop.io.Text.decode(Text.java:389)
>> at org.apache.hadoop.io.Text.toString(Text.java:280)
>> at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>> at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>> at 
>> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>> at 
>> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 
>> I have set a small memory "dumper" in my app. At the beginning, it says:
>> 
>> **  Free . 1,413,566
>> **  Allocated  1,705,984
>> **  Max .. 16,495,104
>> **> Total free ... 16,202,686
>> Just before the crash, it says:
>> 
>> **  Free . 1,461,633
>> **  Allocated  1,786,880
>> **  Max .. 16,495,104
>> **> Total free ... 16,169,857
>> 
>> 
>> 
>> 
> 



Re: Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
I have added:

SparkConf conf = new 
SparkConf().setAppName("app").setExecutorEnv("spark.executor.memory", "8g")
.setMaster("spark://10.0.100.120:7077");

but it did not change a thing

> On Jul 13, 2016, at 9:14 PM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Hi,
> 
> I have a Java memory issue with Spark. The same application working on my 8GB 
> Mac crashes on my 72GB Ubuntu server...
> 
> I have changed things in the conf file, but it looks like Spark does not 
> care, so I wonder if my issues are with the driver or executor.
> 
> I set:
> 
> spark.driver.memory 20g
> spark.executor.memory   20g
> And, whatever I do, the crash is always at the same spot in the app, which 
> makes me think that it is a driver problem.
> 
> The exception I get is:
> 
> 16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 208, 
> micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
> at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
> at java.nio.CharBuffer.allocate(CharBuffer.java:335)
> at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
> at org.apache.hadoop.io.Text.decode(Text.java:412)
> at org.apache.hadoop.io.Text.decode(Text.java:389)
> at org.apache.hadoop.io.Text.toString(Text.java:280)
> at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
> at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
> at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
> at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
> at 
> org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 
> I have set a small memory "dumper" in my app. At the beginning, it says:
> 
> **  Free . 1,413,566
> **  Allocated  1,705,984
> **  Max .. 16,495,104
> **> Total free ... 16,202,686
> Just before the crash, it says:
> 
> **  Free . 1,461,633
> **  Allocated  1,786,880
> **  Max .. 16,495,104
> **> Total free ... 16,169,857
> 
> 
> 
> 



Memory issue java.lang.OutOfMemoryError: Java heap space

2016-07-13 Thread Jean Georges Perrin
Hi,

I have a Java memory issue with Spark. The same application working on my 8GB 
Mac crashes on my 72GB Ubuntu server...

I have changed things in the conf file, but it looks like Spark does not care, 
so I wonder if my issues are with the driver or executor.

I set:

spark.driver.memory 20g
spark.executor.memory   20g
And, whatever I do, the crash is always at the same spot in the app, which 
makes me think that it is a driver problem.

The exception I get is:

16/07/13 20:36:30 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 208, 
micha.nc.rr.com): java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:335)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:810)
at org.apache.hadoop.io.Text.decode(Text.java:412)
at org.apache.hadoop.io.Text.decode(Text.java:389)
at org.apache.hadoop.io.Text.toString(Text.java:280)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd$1.apply(JSONRelation.scala:105)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1135)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
at 
org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1136)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I have set a small memory "dumper" in my app. At the beginning, it says:

**  Free . 1,413,566
**  Allocated  1,705,984
**  Max .. 16,495,104
**> Total free ... 16,202,686
Just before the crash, it says:

**  Free . 1,461,633
**  Allocated  1,786,880
**  Max .. 16,495,104
**> Total free ... 16,169,857






Re: "client / server" config

2016-07-10 Thread Jean Georges Perrin
Good for the file :)

No it goes on... Like if it was waiting for something

jg


> On Jul 10, 2016, at 22:55, ayan guha <guha.a...@gmail.com> wrote:
> 
> Is this terminating the execution or spark application still runs after this 
> error?
> 
> One thing for sure, it is looking for local file on driver (ie your mac) @ 
> location: file:/Users/jgp/Documents/Data/restaurants-data.json 
> 
>> On Mon, Jul 11, 2016 at 12:33 PM, Jean Georges Perrin <j...@jgp.net> wrote:
>> 
>> I have my dev environment on my Mac. I have a dev Spark server on a freshly 
>> installed physical Ubuntu box.
>> 
>> I had some connection issues, but it is now all fine.
>> 
>> In my code, running on the Mac, I have:
>> 
>>  1   SparkConf conf = new 
>> SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:7077");
>>  2   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>>  3   javaSparkContext.setLogLevel("WARN");
>>  4   SQLContext sqlContext = new SQLContext(javaSparkContext);
>>  5
>>  6   // Restaurant Data
>>  7   df = sqlContext.read().option("dateFormat", 
>> "-mm-dd").json(source.getLocalStorage());
>> 
>> 
>> 1) Clarification question: This code runs on my mac, connects to the server, 
>> but line #7 assumes the file is on my mac, not on the server, right?
>> 
>> 2) On line 7, I get an exception:
>> 
>> 16-07-10 22:20:04:143 DEBUG  - address: jgp-MacBook-Air.local/10.0.100.100 
>> isLoopbackAddress: false, with host 10.0.100.100 jgp-MacBook-Air.local
>> 16-07-10 22:20:04:240 INFO 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation - Listing 
>> file:/Users/jgp/Documents/Data/restaurants-data.json on driver
>> 16-07-10 22:20:04:288 DEBUG org.apache.hadoop.util.Shell - Failed to detect 
>> a valid hadoop home directory
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>>  at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
>>  at org.apache.hadoop.util.Shell.(Shell.java:250)
>>  at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>>  at 
>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
>>  at scala.Option.getOrElse(Option.scala:120)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
>>  at 
>> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
>>  at scala.Option.getOrElse(Option.scala:120)
>> 
>> Do I have to install HADOOP on the server? - I imagine that from:
>> java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
>> 
>> TIA,
>> 
>> jg
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha


"client / server" config

2016-07-10 Thread Jean Georges Perrin

I have my dev environment on my Mac. I have a dev Spark server on a freshly 
installed physical Ubuntu box.

I had some connection issues, but it is now all fine.

In my code, running on the Mac, I have:

1   SparkConf conf = new 
SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:7077");
2   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
3   javaSparkContext.setLogLevel("WARN");
4   SQLContext sqlContext = new SQLContext(javaSparkContext);
5
6   // Restaurant Data
7   df = sqlContext.read().option("dateFormat", 
"-mm-dd").json(source.getLocalStorage());


1) Clarification question: This code runs on my mac, connects to the server, 
but line #7 assumes the file is on my mac, not on the server, right?

2) On line 7, I get an exception:

16-07-10 22:20:04:143 DEBUG  - address: jgp-MacBook-Air.local/10.0.100.100 
isLoopbackAddress: false, with host 10.0.100.100 jgp-MacBook-Air.local
16-07-10 22:20:04:240 INFO 
org.apache.spark.sql.execution.datasources.json.JSONRelation - Listing 
file:/Users/jgp/Documents/Data/restaurants-data.json on driver
16-07-10 22:20:04:288 DEBUG org.apache.hadoop.util.Shell - Failed to detect a 
valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:225)
at org.apache.hadoop.util.Shell.(Shell.java:250)
at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
at 
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:447)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation.org$apache$spark$sql$execution$datasources$json$JSONRelation$$createBaseRdd(JSONRelation.scala:98)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4$$anonfun$apply$1.apply(JSONRelation.scala:115)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:115)
at 
org.apache.spark.sql.execution.datasources.json.JSONRelation$$anonfun$4.apply(JSONRelation.scala:109)
at scala.Option.getOrElse(Option.scala:120)

Do I have to install HADOOP on the server? - I imagine that from:
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.

TIA,

jg



Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
It appears like i had issues in my /etc/hosts... it seems ok now

> On Jul 10, 2016, at 2:13 PM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> I tested that:
> 
> I set:
> 
> _JAVA_OPTIONS=-Djava.net.preferIPv4Stack=true
> SPARK_LOCAL_IP=10.0.100.120
> I still have the warning in the log:
> 
> 16/07/10 14:10:13 WARN Utils: Your hostname, micha resolves to a loopback 
> address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
> 16/07/10 14:10:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> and still connection refused...
> 
> but no luck
> 
>> On Jul 10, 2016, at 1:26 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> 
>> Hi,
>> 
>> So far I have been using Spark "embedded" in my app. Now, I'd like to run it 
>> on a dedicated server.
>> 
>> I am that far:
>> - fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 
>> 2.10, installed Spark 1.6.2, recompiled
>> - Pi test works
>> - UI on port 8080 works
>> 
>> Log says:
>> Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
>> /opt/apache-spark-1.6.2/conf/:/opt/apache-spark-1.6.2/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.2.0.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-core-3.2.10.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
>>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip micha --port 7077 
>> --webui-port 8080
>> 
>> Using Spark's default log4j profile: 
>> org/apache/spark/log4j-defaults.properties
>> 16/07/10 13:03:55 INFO Master: Registered signal handlers for [TERM, HUP, 
>> INT]
>> 16/07/10 13:03:55 WARN Utils: Your hostname, micha resolves to a loopback 
>> address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
>> 16/07/10 13:03:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
>> another address
>> 16/07/10 13:03:55 WARN NativeCodeLoader: Unable to load native-hadoop 
>> library for your platform... using builtin-java classes where applicable
>> 16/07/10 13:03:55 INFO SecurityManager: Changing view acls to: root
>> 16/07/10 13:03:55 INFO SecurityManager: Changing modify acls to: root
>> 16/07/10 13:03:55 INFO SecurityManager: SecurityManager: authentication 
>> disabled; ui acls disabled; users with view permissions: Set(root); users 
>> with modify permissions: Set(root)
>> 16/07/10 13:03:56 INFO Utils: Successfully started service 'sparkMaster' on 
>> port 7077.
>> 16/07/10 13:03:56 INFO Master: Starting Spark master at spark://micha:7077 
>> 
>> 16/07/10 13:03:56 INFO Master: Running Spark version 1.6.2
>> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
>> 16/07/10 13:03:56 INFO AbstractConnector: Started 
>> SelectChannelConnector@0.0.0.0 <mailto:SelectChannelConnector@0.0.0.0>:8080
>> 16/07/10 13:03:56 INFO Utils: Successfully started service 'MasterUI' on 
>> port 8080.
>> 16/07/10 13:03:56 INFO MasterWebUI: Started MasterWebUI at 
>> http://10.0.100.120:8080 <http://10.0.100.120:8080/>
>> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
>> 16/07/10 13:03:56 INFO AbstractConnector: Started 
>> SelectChannelConnector@micha:6066
>> 16/07/10 13:03:56 INFO Utils: Successfully started service on port 6066.
>> 16/07/10 13:03:56 INFO StandaloneRestServer: Started REST server for 
>> submitting applications on port 6066
>> 16/07/10 13:03:56 INFO Master: I have been elected leader! New state: ALIVE
>> 
>> 
>> In my app, i changed the config to:
>>  SparkConf conf = new 
>> SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:6066 
>> ");
>> 
>> 
>> (also tried 7077)
>> 
>> 
>> On the client:
>> 16-07-10 13:22:58:300 INFO org.spark-project.jetty.server.AbstractConnector 
>> - Started SelectChannelConnector@0.0.0.0 
>> <mailto:SelectChannelConnector@0.0.0.0>:4040
>> 16-07-10 13:22:58:300 DEBUG 
>> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
>> SelectChannelConnector@0.0.0.0 <mailto:SelectChannelConnector@0.0.0.0>:4040
>> 16-07-10 13:22:58:300 DEBUG 
>> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
>> org.spark-project.jetty.server.Server@3eb292cd
>> 16-07-10 13:22:58:301 INFO org.apache.spark.util.Utils - Successfully 
>> started service 'SparkUI' on port 4040.
>> 16-07-10 13:22:58:306 INFO org.apach

Re: Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
I tested that:

I set:

_JAVA_OPTIONS=-Djava.net.preferIPv4Stack=true
SPARK_LOCAL_IP=10.0.100.120
I still have the warning in the log:

16/07/10 14:10:13 WARN Utils: Your hostname, micha resolves to a loopback 
address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
16/07/10 14:10:13 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
and still connection refused...

but no luck

> On Jul 10, 2016, at 1:26 PM, Jean Georges Perrin <j...@jgp.net> wrote:
> 
> Hi,
> 
> So far I have been using Spark "embedded" in my app. Now, I'd like to run it 
> on a dedicated server.
> 
> I am that far:
> - fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 
> 2.10, installed Spark 1.6.2, recompiled
> - Pi test works
> - UI on port 8080 works
> 
> Log says:
> Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
> /opt/apache-spark-1.6.2/conf/:/opt/apache-spark-1.6.2/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.2.0.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-core-3.2.10.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
>  -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip micha --port 7077 
> --webui-port 8080
> 
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 16/07/10 13:03:55 INFO Master: Registered signal handlers for [TERM, HUP, INT]
> 16/07/10 13:03:55 WARN Utils: Your hostname, micha resolves to a loopback 
> address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
> 16/07/10 13:03:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 16/07/10 13:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/07/10 13:03:55 INFO SecurityManager: Changing view acls to: root
> 16/07/10 13:03:55 INFO SecurityManager: Changing modify acls to: root
> 16/07/10 13:03:55 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 16/07/10 13:03:56 INFO Utils: Successfully started service 'sparkMaster' on 
> port 7077.
> 16/07/10 13:03:56 INFO Master: Starting Spark master at spark://micha:7077 
> 
> 16/07/10 13:03:56 INFO Master: Running Spark version 1.6.2
> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
> 16/07/10 13:03:56 INFO AbstractConnector: Started 
> SelectChannelConnector@0.0.0.0 <mailto:SelectChannelConnector@0.0.0.0>:8080
> 16/07/10 13:03:56 INFO Utils: Successfully started service 'MasterUI' on port 
> 8080.
> 16/07/10 13:03:56 INFO MasterWebUI: Started MasterWebUI at 
> http://10.0.100.120:8080 <http://10.0.100.120:8080/>
> 16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
> 16/07/10 13:03:56 INFO AbstractConnector: Started 
> SelectChannelConnector@micha:6066
> 16/07/10 13:03:56 INFO Utils: Successfully started service on port 6066.
> 16/07/10 13:03:56 INFO StandaloneRestServer: Started REST server for 
> submitting applications on port 6066
> 16/07/10 13:03:56 INFO Master: I have been elected leader! New state: ALIVE
> 
> 
> In my app, i changed the config to:
>   SparkConf conf = new 
> SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:6066 
> ");
> 
> 
> (also tried 7077)
> 
> 
> On the client:
> 16-07-10 13:22:58:300 INFO org.spark-project.jetty.server.AbstractConnector - 
> Started SelectChannelConnector@0.0.0.0 
> <mailto:SelectChannelConnector@0.0.0.0>:4040
> 16-07-10 13:22:58:300 DEBUG 
> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
> SelectChannelConnector@0.0.0.0 <mailto:SelectChannelConnector@0.0.0.0>:4040
> 16-07-10 13:22:58:300 DEBUG 
> org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
> org.spark-project.jetty.server.Server@3eb292cd
> 16-07-10 13:22:58:301 INFO org.apache.spark.util.Utils - Successfully started 
> service 'SparkUI' on port 4040.
> 16-07-10 13:22:58:306 INFO org.apache.spark.ui.SparkUI - Started SparkUI at 
> http://10.0.100.100:4040 <http://10.0.100.100:4040/>
> 16-07-10 13:22:58:621 INFO 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint - Connecting to 
> master spark://10.0.100.120:6066 ...
> 16-07-10 13:22:58:648 DEBUG 
> org.apache.spark.network.client.TransportClientFactory - Creating new 
> connection to /10.0.100.120:6066
> 16-07-10 13:22:58:689 DEBUG io.netty.util.ResourceLeakDetector - 
> -Dio.netty.leakDetectionLevel: simple
> 16-07-10 13:22:58:714 WARN 
> org.apache.spark.deploy.cli

Network issue on deployment

2016-07-10 Thread Jean Georges Perrin
Hi,

So far I have been using Spark "embedded" in my app. Now, I'd like to run it on 
a dedicated server.

I am that far:
- fresh ubuntu 16, server name is mocha / ip 10.0.100.120, installed scala 
2.10, installed Spark 1.6.2, recompiled
- Pi test works
- UI on port 8080 works

Log says:
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp 
/opt/apache-spark-1.6.2/conf/:/opt/apache-spark-1.6.2/assembly/target/scala-2.10/spark-assembly-1.6.2-hadoop2.2.0.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-core-3.2.10.jar:/opt/apache-spark-1.6.2/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar
 -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip micha --port 7077 
--webui-port 8080

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/07/10 13:03:55 INFO Master: Registered signal handlers for [TERM, HUP, INT]
16/07/10 13:03:55 WARN Utils: Your hostname, micha resolves to a loopback 
address: 127.0.1.1; using 10.0.100.120 instead (on interface eno1)
16/07/10 13:03:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
16/07/10 13:03:55 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/07/10 13:03:55 INFO SecurityManager: Changing view acls to: root
16/07/10 13:03:55 INFO SecurityManager: Changing modify acls to: root
16/07/10 13:03:55 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
16/07/10 13:03:56 INFO Utils: Successfully started service 'sparkMaster' on 
port 7077.
16/07/10 13:03:56 INFO Master: Starting Spark master at spark://micha:7077
16/07/10 13:03:56 INFO Master: Running Spark version 1.6.2
16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
16/07/10 13:03:56 INFO AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:8080
16/07/10 13:03:56 INFO Utils: Successfully started service 'MasterUI' on port 
8080.
16/07/10 13:03:56 INFO MasterWebUI: Started MasterWebUI at 
http://10.0.100.120:8080
16/07/10 13:03:56 INFO Server: jetty-8.y.z-SNAPSHOT
16/07/10 13:03:56 INFO AbstractConnector: Started 
SelectChannelConnector@micha:6066
16/07/10 13:03:56 INFO Utils: Successfully started service on port 6066.
16/07/10 13:03:56 INFO StandaloneRestServer: Started REST server for submitting 
applications on port 6066
16/07/10 13:03:56 INFO Master: I have been elected leader! New state: ALIVE


In my app, i changed the config to:
SparkConf conf = new 
SparkConf().setAppName("myapp").setMaster("spark://10.0.100.120:6066");


(also tried 7077)


On the client:
16-07-10 13:22:58:300 INFO org.spark-project.jetty.server.AbstractConnector - 
Started SelectChannelConnector@0.0.0.0:4040
16-07-10 13:22:58:300 DEBUG 
org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
SelectChannelConnector@0.0.0.0:4040
16-07-10 13:22:58:300 DEBUG 
org.spark-project.jetty.util.component.AbstractLifeCycle - STARTED 
org.spark-project.jetty.server.Server@3eb292cd
16-07-10 13:22:58:301 INFO org.apache.spark.util.Utils - Successfully started 
service 'SparkUI' on port 4040.
16-07-10 13:22:58:306 INFO org.apache.spark.ui.SparkUI - Started SparkUI at 
http://10.0.100.100:4040
16-07-10 13:22:58:621 INFO 
org.apache.spark.deploy.client.AppClient$ClientEndpoint - Connecting to master 
spark://10.0.100.120:6066...
16-07-10 13:22:58:648 DEBUG 
org.apache.spark.network.client.TransportClientFactory - Creating new 
connection to /10.0.100.120:6066
16-07-10 13:22:58:689 DEBUG io.netty.util.ResourceLeakDetector - 
-Dio.netty.leakDetectionLevel: simple
16-07-10 13:22:58:714 WARN 
org.apache.spark.deploy.client.AppClient$ClientEndpoint - Failed to connect to 
master 10.0.100.120:6066
java.io.IOException: Failed to connect to /10.0.100.120:6066
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)

and if I try to telnet:

$ telnet 10.0.100.120 6066
Trying 10.0.100.120...
telnet: connect to address 10.0.100.120: Connection refused
telnet: Unable to connect to remote host

$ telnet 10.0.100.120 7077
Trying 10.0.100.120...
telnet: connect to address 10.0.100.120: Connection refused
telnet: Unable to connect to remote host

On the server, I checked with netstat:
jgp@micha:/opt/apache-spark$ netstat -a | grep 6066
tcp6   0  0 micha.nc.rr.com:6066[::]:*  LISTEN 
jgp@micha:/opt/apache-spark$ netstat -a | grep 7077
tcp6   0  0 micha.nc.rr.com:7077[::]:*  LISTEN 

If I interpret this, it looks like it listens in IP v6 and not 4...

Any clue would be very helpful. I do not think I am that far, but...

Thanks


jg






Re: Processing json document

2016-07-07 Thread Jean Georges Perrin
do you want id1, id2, id3 to be processed similarly?

The Java code I use is:
df = df.withColumn(K.NAME, df.col("fields.premise_name"));

the original structure is something like {"fields":{"premise_name":"ccc"}}

hope it helps

> On Jul 7, 2016, at 1:48 AM, Lan Jiang  wrote:
> 
> Hi, there
> 
> Spark has provided json document processing feature for a long time. In most 
> examples I see, each line is a json object in the sample file. That is the 
> easiest case. But how can we process a json document, which does not conform 
> to this standard format (one line per json object)? Here is the document I am 
> working on. 
> 
> First of all, it is multiple lines for one single big json object. The real 
> file can be as long as 20+ G. Within that one single json object, it contains 
> many name/value pairs. The name is some kind of id values. The value is the 
> actual json object that I would like to be part of dataframe. Is there any 
> way to do that? Appreciate any input. 
> 
> 
> {
> "id1": {
> "Title":"title1",
> "Author":"Tom",
> "Source":{
> "Date":"20160506",
> "Type":"URL"
> },
> "Data":" blah blah"},
> 
> "id2": {
> "Title":"title2",
> "Author":"John",
> "Source":{
> "Date":"20150923",
> "Type":"URL"
> },
> "Data":"  blah blah "},
> 
> "id3: {
> "Title":"title3",
> "Author":"John",
> "Source":{
> "Date":"20150902",
> "Type":"URL"
> },
> "Data":" blah blah "}
> }
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
Right now, I am having "fun" with Spark and 26446249960843350 datapoints on my 
MacBook Air, but my small friend is suffering... 

From my experience:
You will be able to do the job with Spark. You can try to load everything on a 
dev machine, no need to have a server, a workstation might be enough.
I would not recommend VM when you go to production, unless you have them. Bare 
metal seems more suitable.

It's definitely worth a shot!

> On Jul 6, 2016, at 3:39 PM, Andreas Bauer <dabuks...@gmail.com> wrote:
> 
> The sql statements are embedded in a PL/1 program using DB2 running ob z/OS. 
> Quite powerful, but expensive and foremost shared withother jobs in the 
> comapny. The whole job takes approx. 20 minutes. 
> 
> So I was thinking to use Spark and let the Spark job run on 10 or 20 virtual 
> instances, which I can spawn easily, on-demand and almost for free using a 
> cloud infrastructure. 
> 
> 
> 
> 
> On 6. Juli 2016 um 21:29:53 MESZ, Jean Georges Perrin <j...@jgp.net> wrote:
>> What are you doing it on right now?
>> 
>> > On Jul 6, 2016, at 3:25 PM, dabuki wrote:
>> > 
>> > I was thinking about to replace a legacy batch job with Spark, but I'm not
>> > sure if Spark is suited for this use case. Before I start the proof of
>> > concept, I wanted to ask for opinions.
>> > 
>> > The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
>> > Every row contains a (book) order with an id and for each row approx. 15
>> > processing steps have to be performed that involve access to multiple
>> > database tables. In total approx. 25 tables (each containing 10k-700k
>> > entries) have to be scanned using the book's id and the retrieved data is
>> > joined together. 
>> > 
>> > As I'm new to Spark I'm not sure if I can leverage Spark's processing model
>> > for this use case.
>> > 
>> > 
>> > 
>> > 
>> > 
>> > --
>> > View this message in context: 
>> > http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> > 
>> > -
>> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> > 
>> 



Re: Is Spark suited for replacing a batch job using many database tables?

2016-07-06 Thread Jean Georges Perrin
What are you doing it on right now?

> On Jul 6, 2016, at 3:25 PM, dabuki  wrote:
> 
> I was thinking about to replace a legacy batch job with Spark, but I'm not
> sure if Spark is suited for this use case. Before I start the proof of
> concept, I wanted to ask for opinions.
> 
> The legacy job works as follows: A file (100k - 1 mio entries) is iterated.
> Every row contains a (book) order with an id and for each row approx. 15
> processing steps have to be performed that involve access to multiple
> database tables. In total approx. 25 tables (each containing 10k-700k
> entries) have to be scanned using the book's id and the retrieved data is
> joined together. 
> 
> As I'm new to Spark I'm not sure if I can leverage Spark's processing model
> for this use case.
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-suited-for-replacing-a-batch-job-using-many-database-tables-tp27300.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: log traces

2016-07-04 Thread Jean Georges Perrin
Hi Luis,

Right...

I managed all my Spark "things" through Maven, bu that I mean I have a pom.xml 
with all the dependencies in it. Here it is:

http://maven.apache.org/POM/4.0.0; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd;>
4.0.0
app
main
0.0.1



maven-compiler-plugin
3.3

1.7
1.7






mysql
mysql-connector-java
5.1.6


org.hibernate
hibernate-core
5.2.0.Final


org.apache.spark
spark-core_2.10
1.6.2


org.apache.spark
spark-sql_2.10
1.6.2
provided


com.databricks
spark-csv_2.10
1.4.0


org.apache.commons
commons-lang3
3.4


joda-time
joda-time
2.9.4





When I run the application, I run it, no through Maven, but through Eclipse as 
a run configuration.

At not point I see or set a SPARK_HOME. I tried programmatically as well and 
Spark does not get it.

I do not connect to a Spark cluster (yet) just on my machine...

I hope it is clear, just started spark'ing recently...

jg




> On Jul 4, 2016, at 6:28 PM, Luis Mateos <luismat...@gmail.com> wrote:
> 
> Hi Jean, 
> 
> What do you mean by "running everything through maven"? Usually, applications 
> are compiled using maven and then launched by using the 
> $SPARK_HOME/bin/spark-submit script. It might be helpful to provide us more 
> details on how you are running your application.
> 
> Regards,
> Luis
> 
> On 4 July 2016 at 16:57, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hey Anupam,
> 
> Thanks... but no:
> 
> I tried:
> 
>   SparkConf conf = new SparkConf().setAppName("my 
> app").setMaster("local");
>   JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>   javaSparkContext.setLogLevel("WARN");
>   SQLContext sqlContext = new SQLContext(javaSparkContext);
> 
> and
> 
>   SparkConf conf = new SparkConf().setAppName("my 
> app").setMaster("local");
>   SparkContext sc = new SparkContext(conf);
>   sc.setLogLevel("WARN");
>   SQLContext sqlContext = new SQLContext(sc);
> 
> and they are still very upset at my console :)...
> 
> 
>> On Jul 4, 2016, at 5:28 PM, Anupam Bhatnagar <anupambhatna...@gmail.com 
>> <mailto:anupambhatna...@gmail.com>> wrote:
>> 
>> Hi Jean,
>> 
>> How about using sc.setLogLevel("WARN") ? You may add this statement after 
>> initializing the Spark Context. 
>> 
>> From the Spark API - "Valid log levels include: ALL, DEBUG, ERROR, FATAL, 
>> INFO, OFF, TRACE, WARN". Here's the link in the Spark API. 
>> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
>>  
>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext>
>> 
>> Hope this helps,
>> Anupam
>>   
>> 
>> 
>> 
>> On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> Thanks Mich, but what is SPARK_HOME when you run everything through Maven?
>> 
>>> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> check %SPARK_HOME/conf
>>> 
>>> copy file log4j.properties.template to log4j.properties
>>> 
>>> edit log4j.properties and set the log levels to your needs
>>> 
>>> cat log4j.properties
>>> 
>>>

Re: log traces

2016-07-04 Thread Jean Georges Perrin
Hey Anupam,

Thanks... but no:

I tried:

SparkConf conf = new SparkConf().setAppName("my 
app").setMaster("local");
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
javaSparkContext.setLogLevel("WARN");
SQLContext sqlContext = new SQLContext(javaSparkContext);

and

SparkConf conf = new SparkConf().setAppName("my 
app").setMaster("local");
SparkContext sc = new SparkContext(conf);
sc.setLogLevel("WARN");
SQLContext sqlContext = new SQLContext(sc);

and they are still very upset at my console :)...


> On Jul 4, 2016, at 5:28 PM, Anupam Bhatnagar <anupambhatna...@gmail.com> 
> wrote:
> 
> Hi Jean,
> 
> How about using sc.setLogLevel("WARN") ? You may add this statement after 
> initializing the Spark Context. 
> 
> From the Spark API - "Valid log levels include: ALL, DEBUG, ERROR, FATAL, 
> INFO, OFF, TRACE, WARN". Here's the link in the Spark API. 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
>  
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext>
> 
> Hope this helps,
> Anupam
>   
> 
> 
> 
> On Mon, Jul 4, 2016 at 2:18 PM, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Thanks Mich, but what is SPARK_HOME when you run everything through Maven?
> 
>> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> check %SPARK_HOME/conf
>> 
>> copy file log4j.properties.template to log4j.properties
>> 
>> edit log4j.properties and set the log levels to your needs
>> 
>> cat log4j.properties
>> 
>> # Set everything to be logged to the console
>> log4j.rootCategory=ERROR, console
>> log4j.appender.console=org.apache.log4j.ConsoleAppender
>> log4j.appender.console.target=System.err
>> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
>> %c{1}: %m%n
>> # Settings to quiet third party logs that are too verbose
>> log4j.logger.org.spark-project.jetty=WARN
>> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
>> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
>> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
>> log4j.logger.org.apache.parquet=ERROR
>> log4j.logger.parquet=ERROR
>> # SPARK-9183: Settings to avoid annoying messages when looking up 
>> nonexistent UDFs in SparkSQL with Hive support
>> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
>> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
>> 
>> HTH
>> 
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> On 4 July 2016 at 21:56, Jean Georges Perrin <j...@jgp.net 
>> <mailto:j...@jgp.net>> wrote:
>> Hi,
>> 
>> I have installed Apache Spark via Maven.
>> 
>> How can I control the volume of log it displays on my system? I tried 
>> different location for a log4j.properties, but none seems to work for me.
>> 
>> Thanks for help...
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
>> <mailto:user-unsubscr...@spark.apache.org>
>> 
>> 
> 
> 



Re: log traces

2016-07-04 Thread Jean Georges Perrin
Thanks Mich, but what is SPARK_HOME when you run everything through Maven?

> On Jul 4, 2016, at 5:12 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> check %SPARK_HOME/conf
> 
> copy file log4j.properties.template to log4j.properties
> 
> edit log4j.properties and set the log levels to your needs
> 
> cat log4j.properties
> 
> # Set everything to be logged to the console
> log4j.rootCategory=ERROR, console
> log4j.appender.console=org.apache.log4j.ConsoleAppender
> log4j.appender.console.target=System.err
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p 
> %c{1}: %m%n
> # Settings to quiet third party logs that are too verbose
> log4j.logger.org.spark-project.jetty=WARN
> log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
> log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
> log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
> log4j.logger.org.apache.parquet=ERROR
> log4j.logger.parquet=ERROR
> # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent 
> UDFs in SparkSQL with Hive support
> log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
> log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
> On 4 July 2016 at 21:56, Jean Georges Perrin <j...@jgp.net 
> <mailto:j...@jgp.net>> wrote:
> Hi,
> 
> I have installed Apache Spark via Maven.
> 
> How can I control the volume of log it displays on my system? I tried 
> different location for a log4j.properties, but none seems to work for me.
> 
> Thanks for help...
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
> 



log traces

2016-07-04 Thread Jean Georges Perrin
Hi,

I have installed Apache Spark via Maven.

How can I control the volume of log it displays on my system? I tried different 
location for a log4j.properties, but none seems to work for me. 

Thanks for help...
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org