User class threw exception: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

2018-04-27 Thread amit kumar singh
Hi Team,

I am working on structured streaming

i have added all libraries in build,sbt then also its not picking up right
library an failing with error

User class threw exception: java.lang.ClassNotFoundException: Failed to
find data source: kafka. Please find packages at
http://spark.apache.org/third-party-projects.html

i am using jenkins to deploy this task

thanks
amit


Re: ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Thodoris Zois
I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code. Logistic 
regression took 85 minutes and linear regression 127 seconds… 

My dataset as I said is 128 MB and contains: 1000 features and ~100 classes. 


#SparkSession
ss = SparkSession.builder.getOrCreate()


start = time.time()

#Read data
trainData = ss.read.format("csv").option("inferSchema","true").load(file)

#Calculate Features
assembler = VectorAssembler(inputCols=trainData.columns[1:], 
outputCol="features")
trainData = assembler.transform(trainData)

#Drop columns
dropColumns = trainData.columns
dropColumns = [e for e in dropColumns if e not in ('_c0', 'features')]
trainData = trainData.drop(*dropColumns)

#Rename column from _c0 to label
trainData = trainData.withColumnRenamed("_c0", "label")

#Logistic regression
lr = LogisticRegression(maxIter=500, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(trainData)

#Output Coefficients
print("Coefficients: " + str(lrModel.coefficientMatrix))



- Thodoris


> On 27 Apr 2018, at 22:50, Irving Duran  wrote:
> 
> Are you reformatting the data correctly for logistic (meaning 0 & 1's) before 
> modeling?  What are OS and spark version you using?
> 
> Thank You,
> 
> Irving Duran
> 
> 
> On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois  > wrote:
> Hello,
> 
> I am running an experiment to test logistic and linear regression on spark 
> using MLlib.
> 
> My dataset is only 128MB and something weird happens. Linear regression takes 
> about 127 seconds either with 1 or 500 iterations. On the other hand, 
> logistic regression most of the times does not manage to finish either with 1 
> iteration. I usually get memory heap error.
> 
> In both cases I use the default cores and memory for driver and I spawn 1 
> executor with 1 core and 2GBs of memory. 
> 
> Except that, I get a warning about NativeBLAS. I searched in the Internet and 
> I found that I have to install libgfortran. Even if I did it the warning 
> remains.
> 
> Any ideas for the above?
> 
> Thank you,
> - Thodoris
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> 
> 



export dataset in image format

2018-04-27 Thread Soheil Pourbafrani
Hi, usin Spark 2.3 I read a image in dataset using imageschema. Now after
some changes, I want to save dataset as a new image. How can I achieve this
in Spark ?


Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid Lezzar
I’m using spark 2.3 with schema merge set to false. I don’t think spark is 
reading any file indeed but it tries to list them all one by one and it’s super 
slow on s3 ! 

Pointing to a single partition manually is not an option as it requires me to 
be aware of the partitioning in order to add it to the path and also, spark 
doesn’t include the partitioning column in that case.

> Le 27 avr. 2018 à 16:07, Yong Zhang  a écrit :
> 
> What version of Spark you are using?
> 
> You can search "spark.sql.parquet.mergeSchema" on 
> https://spark.apache.org/docs/latest/sql-programming-guide.html
> 
> Starting from Spark 1.5, the default is already "false", which means Spark 
> shouldn't scan all the parquet files to generate the schema.
> 
> Yong
> Spark SQL and DataFrames - Spark 2.3.0 Documentation
> spark.apache.org
> Global Temporary View. Temporary views in Spark SQL are session-scoped and 
> will disappear if the session that creates it terminates. If you want to have 
> a temporary view that is shared among all sessions and keep alive until the 
> Spark application terminates, you can create a global temporary view.
> 
> 
> 
> From: Walid LEZZAR 
> Sent: Friday, April 27, 2018 7:42 AM
> To: spark users
> Subject: How to read the schema of a partitioned dataframe without listing 
> all the partitions ?
>  
> Hi,
> 
> I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 
> 1000 partitions). With spark, when I just want to know the schema of this 
> parquet without even asking for a single row of data, spark tries to list all 
> the partitions and the nested partitions of the parquet. Which makes it very 
> slow just to build the dataframe object on Zeppelin.
> 
> Is there a way to avoid that ? Is there way to tell spark : "hey, just read a 
> single partition and give me the schema of that partition and consider it as 
> the schema of the whole dataframe" ? (I don't care about schema merge, it's 
> off by the way)
> 
> Thanks.
> Walid.


Re: ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Irving Duran
Are you reformatting the data correctly for logistic (meaning 0 & 1's)
before modeling?  What are OS and spark version you using?

Thank You,

Irving Duran


On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois  wrote:

> Hello,
>
> I am running an experiment to test logistic and linear regression on spark
> using MLlib.
>
> My dataset is only 128MB and something weird happens. Linear regression
> takes about 127 seconds either with 1 or 500 iterations. On the other hand,
> logistic regression most of the times does not manage to finish either with
> 1 iteration. I usually get memory heap error.
>
> In both cases I use the default cores and memory for driver and I spawn 1
> executor with 1 core and 2GBs of memory.
>
> Except that, I get a warning about NativeBLAS. I searched in the Internet
> and I found that I have to install libgfortran. Even if I did it the
> warning remains.
>
> Any ideas for the above?
>
> Thank you,
> - Thodoris
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Thodoris Zois
Hello,

I am running an experiment to test logistic and linear regression on spark 
using MLlib.

My dataset is only 128MB and something weird happens. Linear regression takes 
about 127 seconds either with 1 or 500 iterations. On the other hand, logistic 
regression most of the times does not manage to finish either with 1 iteration. 
I usually get memory heap error.

In both cases I use the default cores and memory for driver and I spawn 1 
executor with 1 core and 2GBs of memory. 

Except that, I get a warning about NativeBLAS. I searched in the Internet and I 
found that I have to install libgfortran. Even if I did it the warning remains.

Any ideas for the above?

Thank you,
- Thodoris

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Tuning Resource Allocation during runtime

2018-04-27 Thread Vadim Semenov
You can not change dynamically the number of cores per executor or cores
per task, but you can change the number of executors.

In one of my jobs I have something like this, so when I know that I don't
need more than 4 executors, I kill all other executors (assuming that they
don't hold any cached data), and they join other jobs (thanks to dynamic
allocation)


// At this point we have 1500 parquet files
// But we want 100 files, which means about 4 executors can process
everything
// assuming that they can process 30 tasks each
// So we can let other executors leave the job
val executors = SparkContextUtil.getExecutorIds(sc)
executors.take(executors.size - 4).foreach(sc.killExecutor)


package org.apache.spark
/**
* `SparkContextUtil` gives access to private methods
*/
object SparkContextUtil {
def getExecutorIds(sc: SparkContext): Seq[String] =
sc.getExecutorIds.filter(_ != SparkContext.DRIVER_IDENTIFIER)




On Fri, Apr 27, 2018 at 3:52 AM, Donni Khan 
wrote:

> Hi All,
>
> Is there any way to change the  number of executors/cores  during running
> Saprk Job.
> I have Spark Job containing two tasks: First task need many executors to
> run fastly. the second task has many input and output opeartions and
> shuffling, so it needs  few executors, otherwise it taks loong time to
> finish.
> Does anyone knows if that possible in YARN?
>
>
> Thank you.
> Donni
>



-- 
Sent from my iPhone


Re: Tuning Resource Allocation during runtime

2018-04-27 Thread jogesh anand
Hi Donni,

Please check spark dynamic allocation and external shuffle service .

On Fri, 27 Apr 2018 at 2:52 AM, Donni Khan 
wrote:

> Hi All,
>
> Is there any way to change the  number of executors/cores  during running
> Saprk Job.
> I have Spark Job containing two tasks: First task need many executors to
> run fastly. the second task has many input and output opeartions and
> shuffling, so it needs  few executors, otherwise it taks loong time to
> finish.
> Does anyone knows if that possible in YARN?
>
>
> Thank you.
> Donni
>
-- 

*Regards,*

*Jogesh Anand*


Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Yong Zhang
What version of Spark you are using?


You can search "spark.sql.parquet.mergeSchema" on 
https://spark.apache.org/docs/latest/sql-programming-guide.html


Starting from Spark 1.5, the default is already "false", which means Spark 
shouldn't scan all the parquet files to generate the schema.


Yong

Spark SQL and DataFrames - Spark 2.3.0 
Documentation
spark.apache.org
Global Temporary View. Temporary views in Spark SQL are session-scoped and will 
disappear if the session that creates it terminates. If you want to have a 
temporary view that is shared among all sessions and keep alive until the Spark 
application terminates, you can create a global temporary view.





From: Walid LEZZAR 
Sent: Friday, April 27, 2018 7:42 AM
To: spark users
Subject: How to read the schema of a partitioned dataframe without listing all 
the partitions ?

Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 
1000 partitions). With spark, when I just want to know the schema of this 
parquet without even asking for a single row of data, spark tries to list all 
the partitions and the nested partitions of the parquet. Which makes it very 
slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a 
single partition and give me the schema of that partition and consider it as 
the schema of the whole dataframe" ? (I don't care about schema merge, it's off 
by the way)

Thanks.
Walid.


Spark Streaming for more file types

2018-04-27 Thread रविशंकर नायर
All,

I have the following methods in my scala code, currently executed on demand

val files = sc.binaryFiles ("file:///imocks/data/ocr/raw")
//Abive line takes all PDF files
files.map(myconveter(_)).count


myconverter  signature:

def myconverter (
file: (String,
org.apache.spark.input.PortableDataStream)
) : Unit  =
{
//Code to interact with IBM Datamap OCR which converts the PDF files into
text

}

I do want to change the above code to Spark streaming. Unfortunately there
is  ( definitely the would be a great addition to Spark) No "binaryFiles"
functions from StreamingContext. The closest I can think of is to write
like this:

//Assuming myconverter is not changed

val dstream = ssc.fileStream[BytesWritable,BytesWritable,
SequenceFileAsBinaryInputFormat]("file:///imocks/data/ocr/raw") ;


dstream.map(myconverter(_))

Unfortunately everything is in problem now. There are errors showing the
method signature does not match etc etc. Can anyone please help how can I
get out of the issue? Appreciate your help.

Also, won't it be a super excellent idea to have all methods of
SparkContext to be reusable for StreamingContext as well ? In that way, it
takes no extra effort to change a batch program to a streaming app.

Best,
Passion


Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread ayan guha
You can specify the first folder directly and read it

On Fri, 27 Apr 2018 at 9:42 pm, Walid LEZZAR  wrote:

> Hi,
>
> I have a parquet on S3 partitioned by day. I have 2 years of data (->
> about 1000 partitions). With spark, when I just want to know the schema of
> this parquet without even asking for a single row of data, spark tries to
> list all the partitions and the nested partitions of the parquet. Which
> makes it very slow just to build the dataframe object on Zeppelin.
>
> Is there a way to avoid that ? Is there way to tell spark : "hey, just
> read a single partition and give me the schema of that partition and
> consider it as the schema of the whole dataframe" ? (I don't care about
> schema merge, it's off by the way)
>
> Thanks.
> Walid.
>
-- 
Best Regards,
Ayan Guha


How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid LEZZAR
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about
1000 partitions). With spark, when I just want to know the schema of this
parquet without even asking for a single row of data, spark tries to list
all the partitions and the nested partitions of the parquet. Which makes it
very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read
a single partition and give me the schema of that partition and consider it
as the schema of the whole dataframe" ? (I don't care about schema merge,
it's off by the way)

Thanks.
Walid.


Tuning Resource Allocation during runtime

2018-04-27 Thread Donni Khan
Hi All,

Is there any way to change the  number of executors/cores  during running
Saprk Job.
I have Spark Job containing two tasks: First task need many executors to
run fastly. the second task has many input and output opeartions and
shuffling, so it needs  few executors, otherwise it taks loong time to
finish.
Does anyone knows if that possible in YARN?


Thank you.
Donni