Re: Reading TB of JSON file

2020-06-18 Thread nihed mbarek
Hi,

What is the size of one json document ?

There is also the scan of your json to define the schema, the overhead can
be huge.
2 solution:
define a schema and use directly during the load or ask spark to analyse a
small part of the json file (I don't remember how to do it)

Regards,


On Thu, Jun 18, 2020 at 3:12 PM Chetan Khatri 
wrote:

> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it
> can be taken into next transformation. I am trying to read as
> spark.read.json(path) but this is giving Out of memory error on driver.
> Obviously, I can't afford having 50 GB on driver memory. In general, what
> is the best practice to read large JSON file like 50 GB?
>
> Thanks
>


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: Spark 2.4.4 with Hadoop 3.2.0

2019-11-25 Thread nihed mbarek
Hi,
Spark 2.x is already part of Cloudera CDH6 who is based on Hadoop 3.x so
they support officially Spark2+Hadoop3
So for sure, there is tests and development done from this side. In other
part,  I don't know the status for Hadoop 3.2.

Regards,

On Tue, Nov 26, 2019 at 1:46 AM Alfredo Marquez 
wrote:

> Thank you Ismael! That's what I was looking for. I can take this to our
> platform team.
>
> Alfredo
>
> On Mon, Nov 25, 2019, 3:32 PM Ismaël Mejía  wrote:
>
>> Not officially. Apache Spark only announced support for Hadoop 3.x
>> starting with the upcoming Spark 3.
>> There is a preview release of Spark 3 with support for Hadoop 3.2 that
>> you can try now:
>>
>> https://archive.apache.org/dist/spark/spark-3.0.0-preview/spark-3.0.0-preview-bin-hadoop3.2.tgz
>>
>> Enjoy!
>>
>>
>>
>> On Tue, Nov 19, 2019 at 3:44 PM Alfredo Marquez <
>> alfredo.g.marq...@gmail.com> wrote:
>>
>>> I also would like know the answer to this question.
>>>
>>> Thanks,
>>>
>>> Alfredo
>>>
>>> On Tue, Nov 19, 2019, 8:24 AM bsikander  wrote:
>>>
 Hi,
 Are Spark 2.4.4 and Hadoop 3.2.0 compatible?
 I tried to search the mailing list but couldn't find anything relevant.





 --
 Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: Use SQL Script to Write Spark SQL Jobs

2017-06-14 Thread nihed mbarek
Hi

I already saw a project with the same idea.
https://github.com/cloudera-labs/envelope

Regards,

On Wed, 14 Jun 2017 at 04:32, bo yang  wrote:

> Thanks Benjamin and Ayan for the feedback! You kind of represent two group
> of people who need such script tool or not. Personally I find the script is
> very useful for myself to write ETL pipelines and daily jobs. Let's see
> whether there are other people interested in such project.
>
> Best,
> Bo
>
>
>
>
>
> On Mon, Jun 12, 2017 at 11:26 PM, ayan guha  wrote:
>
>> Hi
>>
>> IMHO, this approach is not very useful.
>>
>> Firstly, 2 use cases mentioned in the project page:
>>
>> 1. Simplify spark development - I think the only thing can be done there
>> is to come up with some boilerplate function, which essentially will take a
>> sql and come back with a temp table name and a corresponding DF (Remember
>> the project targets structured data sources only, not streaming or RDD).
>> Building another mini-DSL on top of already fairly elaborate spark API
>> never appealed to me.
>>
>> 2. Business Analysts using Spark - single word answer is Notebooks. Take
>> your pick - Jupyter, Zeppelin, Hue.
>>
>> The case of "Spark is for Developers", IMHO, stemmed to the
>> packaging/building overhead of spark apps. For Python users, this barrier
>> is considerably lower (And maybe that is why I do not see a prominent
>> need).
>>
>> But I can imagine the pain of a SQL developer coming into a scala/java
>> world. I came from a hardcore SQL/DWH environment where I used to write SQL
>> and SQL only. So SBT or MVN are still not my friend. Maybe someday they
>> will. But learned them hard way, just because the value of using spark can
>> offset the pain long long way. So, I think there is a need of spending time
>> with the environment to get comfortable with it. And maybe, just maybe,
>> using Nifi in case you miss drag/drop features too much :)
>>
>> But, these are my 2c, and sincerely humble opinion, and I wish you all
>> the luck for your project.
>>
>> On Tue, Jun 13, 2017 at 3:23 PM, Benjamin Kim  wrote:
>>
>>> Hi Bo,
>>>
>>> +1 for your project. I come from the world of data warehouses, ETL, and
>>> reporting analytics. There are many individuals who do not know or want to
>>> do any coding. They are content with ANSI SQL and stick to it. ETL
>>> workflows are also done without any coding using a drag-and-drop user
>>> interface, such as Talend, SSIS, etc. There is a small amount of scripting
>>> involved but not too much. I looked at what you are trying to do, and I
>>> welcome it. This could open up Spark to the masses and shorten development
>>> times.
>>>
>>> Cheers,
>>> Ben
>>>
>>>
>>> On Jun 12, 2017, at 10:14 PM, bo yang  wrote:
>>>
>>> Hi Aakash,
>>>
>>> Thanks for your willing to help :) It will be great if I could get more
>>> feedback on my project. For example, is there any other people feeling the
>>> need of using a script to write Spark job easily? Also, I would explore
>>> whether it is possible that the Spark project takes some work to build such
>>> a script based high level DSL.
>>>
>>> Best,
>>> Bo
>>>
>>>
>>> On Mon, Jun 12, 2017 at 12:14 PM, Aakash Basu <
>>> aakash.spark@gmail.com> wrote:
>>>
 Hey,

 I work on Spark SQL and would pretty much be able to help you in this.
 Let me know your requirement.

 Thanks,
 Aakash.

 On 12-Jun-2017 11:00 AM, "bo yang"  wrote:

> Hi Guys,
>
> I am writing a small open source project
>  to use SQL Script to write
> Spark Jobs. Want to see if there are other people interested to use or
> contribute to this project.
>
> The project is called UberScriptQuery (
> https://github.com/uber/uberscriptquery). Sorry for the dumb name to
> avoid conflict with many other names (Spark is registered trademark, thus 
> I
> could not use Spark in my project name).
>
> In short, it is a high level SQL-like DSL (Domain Specific Language)
> on top of Spark. People can use that DSL to write Spark jobs without
> worrying about Spark internal details. Please check README
>  in the project to get more
> details.
>
> It will be great if I could get any feedback or suggestions!
>
> Best,
> Bo
>
>
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
> --

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread nihed mbarek
and if we have this static method
df.show();
Column c = concatFunction(df, "l1", "firstname,lastname");
df.select(c).show();

with this code :
Column concatFunction(DataFrame df, String fieldName, String columns) {
String[] array = columns.split(",");
Column[] concatColumns = new Column[array.length];
for (int i = 0; i < concatColumns.length; i++) {
concatColumns[i]=df.col(array[i]);
}

return functions.concat(concatColumns).alias(fieldName);
}



On Mon, Jul 18, 2016 at 2:14 PM, Abhishek Anand 
wrote:

> Hi Nihed,
>
> Thanks for the reply.
>
> I am looking for something like this :
>
> DataFrame training = orgdf.withColumn("I1",
> functions.concat(orgdf.col("C0"),orgdf.col("C1")));
>
>
> Here I have to give C0 and C1 columns, I am looking to write a generic
> function that concatenates the columns depending on input columns.
>
> like if I have something
> String str = "C0,C1,C2"
>
> Then it should work as
>
> DataFrame training = orgdf.withColumn("I1",
> functions.concat(orgdf.col("C0"),orgdf.col("C1"),orgdf.col("C2")));
>
>
>
> Thanks,
> Abhi
>
> On Mon, Jul 18, 2016 at 4:39 PM, nihed mbarek  wrote:
>
>> Hi,
>>
>>
>> I just wrote this code to help you. Is it what you need ??
>>
>>
>> SparkConf conf = new
>> SparkConf().setAppName("hello").setMaster("local");
>> JavaSparkContext sc = new JavaSparkContext(conf);
>> SQLContext sqlContext = new SQLContext(sc);
>> List persons = new ArrayList<>();
>> persons.add(new Person("nihed", "mbarek", "nihed.com"));
>> persons.add(new Person("mark", "zuckerberg", "facebook.com"));
>>
>> DataFrame df = sqlContext.createDataFrame(persons, Person.class);
>>
>> df.show();
>> final String[] columns = df.columns();
>> Column[] selectColumns = new Column[columns.length + 1];
>> for (int i = 0; i < columns.length; i++) {
>> selectColumns[i]=df.col(columns[i]);
>> }
>>
>>
>> selectColumns[columns.length]=functions.concat(df.col("firstname"),
>> df.col("lastname"));
>>
>> df.select(selectColumns).show();
>>   ---
>> public static class Person {
>>
>> private String firstname;
>> private String lastname;
>> private String address;
>> }
>>
>>
>>
>> Regards,
>>
>> On Mon, Jul 18, 2016 at 12:45 PM, Abhishek Anand > > wrote:
>>
>>> Hi,
>>>
>>> I have a dataframe say having C0,C1,C2 and so on as columns.
>>>
>>> I need to create interaction variables to be taken as input for my
>>> program.
>>>
>>> For eg -
>>>
>>> I need to create I1 as concatenation of C0,C3,C5
>>>
>>> Similarly, I2  = concat(C4,C5)
>>>
>>> and so on ..
>>>
>>>
>>> How can I achieve this in my Java code for concatenation of any columns
>>> given input by the user.
>>>
>>> Thanks,
>>> Abhi
>>>
>>
>>
>>
>> --
>>
>> M'BAREK Med Nihed,
>> Fedora Ambassador, TUNISIA, Northern Africa
>> http://www.nihed.com
>>
>> <http://tn.linkedin.com/in/nihed>
>>
>>
>


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

<http://tn.linkedin.com/in/nihed>


Re: Concatenate the columns in dataframe to create new collumns using Java

2016-07-18 Thread nihed mbarek
Hi,


I just wrote this code to help you. Is it what you need ??


SparkConf conf = new
SparkConf().setAppName("hello").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new SQLContext(sc);
List persons = new ArrayList<>();
    persons.add(new Person("nihed", "mbarek", "nihed.com"));
persons.add(new Person("mark", "zuckerberg", "facebook.com"));

DataFrame df = sqlContext.createDataFrame(persons, Person.class);

df.show();
final String[] columns = df.columns();
Column[] selectColumns = new Column[columns.length + 1];
for (int i = 0; i < columns.length; i++) {
selectColumns[i]=df.col(columns[i]);
}

selectColumns[columns.length]=functions.concat(df.col("firstname"),
df.col("lastname"));

df.select(selectColumns).show();
  ---
public static class Person {

private String firstname;
private String lastname;
private String address;
}



Regards,

On Mon, Jul 18, 2016 at 12:45 PM, Abhishek Anand 
wrote:

> Hi,
>
> I have a dataframe say having C0,C1,C2 and so on as columns.
>
> I need to create interaction variables to be taken as input for my
> program.
>
> For eg -
>
> I need to create I1 as concatenation of C0,C3,C5
>
> Similarly, I2  = concat(C4,C5)
>
> and so on ..
>
>
> How can I achieve this in my Java code for concatenation of any columns
> given input by the user.
>
> Thanks,
> Abhi
>



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

<http://tn.linkedin.com/in/nihed>


Re: spark.executor.cores

2016-07-15 Thread nihed mbarek
can you try with :
SparkConf conf = new SparkConf().setAppName("NC Eatery app").set(
"spark.executor.memory", "4g")
.setMaster("spark://10.0.100.120:7077");
if (restId == 0) {
conf = conf.set("spark.executor.cores", "22");
} else {
conf = conf.set("spark.executor.cores", "2");
}
JavaSparkContext javaSparkContext = new JavaSparkContext(conf);

On Fri, Jul 15, 2016 at 2:31 PM, Jean Georges Perrin  wrote:

> Hi,
>
> Configuration: standalone cluster, Java, Spark 1.6.2, 24 cores
>
> My process uses all the cores of my server (good), but I am trying to
> limit it so I can actually submit a second job.
>
> I tried
>
> SparkConf conf = new SparkConf().setAppName("NC Eatery app").set(
> "spark.executor.memory", "4g")
> .setMaster("spark://10.0.100.120:7077");
> if (restId == 0) {
> conf = conf.set("spark.executor.cores", "22");
> } else {
> conf = conf.set("spark.executor.cores", "2");
> }
> JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>
> and
>
> SparkConf conf = new SparkConf().setAppName("NC Eatery app").set(
> "spark.executor.memory", "4g")
> .setMaster("spark://10.0.100.120:7077");
> if (restId == 0) {
> conf.set("spark.executor.cores", "22");
> } else {
> conf.set("spark.executor.cores", "2");
> }
> JavaSparkContext javaSparkContext = new JavaSparkContext(conf);
>
> but it does not seem to take it. Any hint?
>
> jg
>
>
>


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: remove row from data frame

2016-07-05 Thread nihed mbarek
hi,

doing multiple filters to keep data that you need.

regards,

On Tue, Jul 5, 2016 at 5:38 PM, pseudo oduesp  wrote:

> Hi ,
> how i can remove row from data frame  verifying some condition on some
> columns ?
> thanks
>



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: Read Kafka topic in a Spark batch job

2016-07-05 Thread nihed mbarek
Hi,

Are you using a new version of kafka  ? if yes
since 0.9 auto.offset.reset parameter take :

   - earliest: automatically reset the offset to the earliest offset
   - latest: automatically reset the offset to the latest offset
   - none: throw exception to the consumer if no previous offset is found
   for the consumer's group
   - anything else: throw exception to the consumer.

https://kafka.apache.org/documentation.html


Regards,

On Tue, Jul 5, 2016 at 2:15 PM, Bruckwald Tamás  wrote:

> Hello,
>
> I'm writing a Spark (v1.6.0) batch job which reads from a Kafka topic.
> For this I can use org.apache.spark.streaming.kafka.KafkaUtils#createRDD
> however, I need to set the offsets for all the partitions and also need to
> store them somewhere (ZK? HDFS?) to know from where to start the next batch
> job.
> What is the right approach to read from Kafka in a batch job?
>
> I'm also thinking about writing a streaming job instead, which reads from
> auto.offset.reset=smallest and saves the checkpoint to HDFS and then in the
> next run it starts from that.
> But in this case how can I just fetch once and stop streaming after the
> first batch?
>
> I posted this question on StackOverflow recently (
> http://stackoverflow.com/q/38026627/4020050) but got no answer there, so
> I'd ask here as well, hoping that I get some ideas on how to resolve this
> issue.
>
> Thanks - Bruckwald
>



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: removing header from csv file

2016-04-26 Thread nihed mbarek
You can add a filter with string that you are sure available only in the
header

Le mercredi 27 avril 2016, Divya Gehlot  a écrit :

> yes you can remove the headers by removing the first row
>
> can first() or head() to do that
>
>
> Thanks,
> Divya
>
> On 27 April 2016 at 13:24, Ashutosh Kumar  > wrote:
>
>> I see there is a library spark-csv which can be used for removing header
>> and processing of csv files. But it seems it works with sqlcontext only. Is
>> there a way to remove header from csv files without sqlcontext ?
>>
>> Thanks
>> Ashutosh
>>
>
>

-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Best practices repartition key

2016-04-22 Thread nihed mbarek
Hi,
I'm looking for documentation or best practices about choosing a key or
keys for repartition of dataframe or rdd

Thank you
MBAREK nihed

-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Spark 1.6.1 already maximum pages

2016-04-21 Thread nihed mbarek
Hi

I just got an issue with my execution on spark 1.6.1
I'm trying to join between two dataframes one of 5 partition and the
second small 2 partition.
Spark Sql shuffle partitions equal to 256000

Any idea ??

java.lang.IllegalStateException: Have already allocated a maximum of 8192
pages
at
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:259)
at
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
at
org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
at
org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
at
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: prefix column Spark

2016-04-19 Thread nihed mbarek
Hi
thank you, it's the first solution and it took a long time to manage all my
fields

Regards,

On Tue, Apr 19, 2016 at 11:29 AM, Ndjido Ardo BAR  wrote:

>
> This can help:
>
> import org.apache.spark.sql.DataFrame
>
> def prefixDf(dataFrame: DataFrame, prefix: String): DataFrame = {
>   val colNames = dataFrame.columns
>   colNames.foldLeft(dataFrame){
> (df, colName) => {
>   df.withColumnRenamed(colName, s"${prefix}_${colName}")
> }
> }
> }
>
> cheers,
> Ardo
>
>
> On Tue, Apr 19, 2016 at 10:53 AM, nihed mbarek  wrote:
>
>> Hi,
>>
>> I want to prefix a set of dataframes and I try two solutions:
>> * A for loop calling withColumnRename based on columns()
>> * transforming my Dataframe to and RDD, updating the old schema and
>> recreating the dataframe.
>>
>>
>> both are working for me, the second one is faster with tables that
>> contain 800 columns but have a more stage of transformation toRDD.
>>
>> Is there any other solution?
>>
>> Thank you
>>
>> --
>>
>> M'BAREK Med Nihed,
>> Fedora Ambassador, TUNISIA, Northern Africa
>> http://www.nihed.com
>>
>> <http://tn.linkedin.com/in/nihed>
>>
>>
>


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

<http://tn.linkedin.com/in/nihed>


prefix column Spark

2016-04-19 Thread nihed mbarek
Hi,

I want to prefix a set of dataframes and I try two solutions:
* A for loop calling withColumnRename based on columns()
* transforming my Dataframe to and RDD, updating the old schema and
recreating the dataframe.


both are working for me, the second one is faster with tables that contain
800 columns but have a more stage of transformation toRDD.

Is there any other solution?

Thank you

-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Spark Yarn closing sparkContext

2016-04-14 Thread nihed mbarek
Hi,
I have an issue with closing my application context, the process take a
long time with a fail at the end. In other part, my result was generate in
the write folder and _SUCESS file was created.
I'm using spark 1.6 with yarn.

any idea ?

regards,

-- 

MBAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread nihed mbarek
I can't write on hadoopConfig in Java

Le vendredi 8 avril 2016, Silvio Fiorito  a
écrit :

> Have you tried:
>
> sc.hadoopConfiguration.setLong(parquet.hadoop.ParquetOutputFormat.BLOCK_SIZE,
> N * 1024 * 1024)
>
> Not sure if it’d work or not, but since it’s getting it from the Hadoop
> config it should do it.
>
>
> From: nihed mbarek  >
> Date: Friday, April 8, 2016 at 12:01 PM
> To: "User@spark.apache.org
> " <
> User@spark.apache.org
> >
> Subject: How to configure parquet.block.size on Spark 1.6
>
> Hi
> How to configure parquet.block.size on Spark 1.6 ?
>
> Thank you
> Nihed MBAREK
>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> <http://tn.linkedin.com/in/nihed>
>
>
>
>
>

-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

<http://tn.linkedin.com/in/nihed>


How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread nihed mbarek
Hi
How to configure parquet.block.size on Spark 1.6 ?

Thank you
Nihed MBAREK


-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com

<http://tn.linkedin.com/in/nihed>


Join FetchFailedException

2016-04-01 Thread nihed mbarek
Hi,
I have a big dataframe 100giga that I need to join with 3 others dataframes.

For the first join, it's ok
For the second, it's ok
But for the third, just after the big shuffle, before the execution of the
stage, I have an exception

org.apache.spark.shuffle.FetchFailedException:
java.io.FileNotFoundException:
/DEVD/data/fs1/hadoop/yarn/log/usercache//appcache/application_1459412930417_0309/blockmgr-ba9d4936-1de0-4205-9913-a579b520ab1f/1a/shuffle_5_65_0.index
(No such file or directory)


Any idea ??


Thank you



-- 

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com