User class threw exception: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

2018-04-27 Thread amit kumar singh
Hi Team, I am working on structured streaming i have added all libraries in build,sbt then also its not picking up right library an failing with error User class threw exception: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at

Re: ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Thodoris Zois
I am in CentOS 7 and I use Spark 2.3.0. Below I have posted my code. Logistic regression took 85 minutes and linear regression 127 seconds… My dataset as I said is 128 MB and contains: 1000 features and ~100 classes. #SparkSession ss = SparkSession.builder.getOrCreate() start = time.time()

export dataset in image format

2018-04-27 Thread Soheil Pourbafrani
Hi, usin Spark 2.3 I read a image in dataset using imageschema. Now after some changes, I want to save dataset as a new image. How can I achieve this in Spark ?

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid Lezzar
I’m using spark 2.3 with schema merge set to false. I don’t think spark is reading any file indeed but it tries to list them all one by one and it’s super slow on s3 ! Pointing to a single partition manually is not an option as it requires me to be aware of the partitioning in order to add it

Re: ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Irving Duran
Are you reformatting the data correctly for logistic (meaning 0 & 1's) before modeling? What are OS and spark version you using? Thank You, Irving Duran On Fri, Apr 27, 2018 at 2:34 PM Thodoris Zois wrote: > Hello, > > I am running an experiment to test logistic and

ML Linear and Logistic Regression - Poor Performance

2018-04-27 Thread Thodoris Zois
Hello, I am running an experiment to test logistic and linear regression on spark using MLlib. My dataset is only 128MB and something weird happens. Linear regression takes about 127 seconds either with 1 or 500 iterations. On the other hand, logistic regression most of the times does not

Re: Tuning Resource Allocation during runtime

2018-04-27 Thread Vadim Semenov
You can not change dynamically the number of cores per executor or cores per task, but you can change the number of executors. In one of my jobs I have something like this, so when I know that I don't need more than 4 executors, I kill all other executors (assuming that they don't hold any cached

Re: Tuning Resource Allocation during runtime

2018-04-27 Thread jogesh anand
Hi Donni, Please check spark dynamic allocation and external shuffle service . On Fri, 27 Apr 2018 at 2:52 AM, Donni Khan wrote: > Hi All, > > Is there any way to change the number of executors/cores during running > Saprk Job. > I have Spark Job containing two

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Yong Zhang
What version of Spark you are using? You can search "spark.sql.parquet.mergeSchema" on https://spark.apache.org/docs/latest/sql-programming-guide.html Starting from Spark 1.5, the default is already "false", which means Spark shouldn't scan all the parquet files to generate the schema.

Spark Streaming for more file types

2018-04-27 Thread रविशंकर नायर
All, I have the following methods in my scala code, currently executed on demand val files = sc.binaryFiles ("file:///imocks/data/ocr/raw") //Abive line takes all PDF files files.map(myconveter(_)).count myconverter signature: def myconverter ( file: (String,

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread ayan guha
You can specify the first folder directly and read it On Fri, 27 Apr 2018 at 9:42 pm, Walid LEZZAR wrote: > Hi, > > I have a parquet on S3 partitioned by day. I have 2 years of data (-> > about 1000 partitions). With spark, when I just want to know the schema of > this

How to read the schema of a partitioned dataframe without listing all the partitions ?

2018-04-27 Thread Walid LEZZAR
Hi, I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions). With spark, when I just want to know the schema of this parquet without even asking for a single row of data, spark tries to list all the partitions and the nested partitions of the parquet. Which

Tuning Resource Allocation during runtime

2018-04-27 Thread Donni Khan
Hi All, Is there any way to change the number of executors/cores during running Saprk Job. I have Spark Job containing two tasks: First task need many executors to run fastly. the second task has many input and output opeartions and shuffling, so it needs few executors, otherwise it taks loong