Help explaining explain() after DataFrame join reordering

2018-06-01 Thread Mohamed Nadjib MAMI
nclude the new cost-based optimizations (introduced in Spark 2.2). *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.* *Mohamed Nadjib Mami* *Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University* *About me! <http://mohamednadjibmami.com>* *Li

Re: df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Mohamed Nadjib MAMI
That was the case. Thanks for the quick and clean answer, Hemanth. *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.* *Mohamed Nadjib Mami* *Research Associate @ Fraunhofer IAIS - PhD Student @ Bonn University* *About me! <http://www.strikingly.com/mohamed-nadjib-m

df.count() returns one more count than SELECT COUNT()

2017-04-06 Thread Mohamed Nadjib Mami
I paste this right from Spark shell (Spark 2.1.0): /scala> spark.sql("SELECT count(distinct col) FROM Table").show()// //+-+ // //|count(DISTINCT col)|// //+-+// //|4697|// //+-+// //scala>

SparkSQL: intra-SparkSQL-application table registration

2016-11-14 Thread Mohamed Nadjib Mami
Hello, I've asked the following question [1] on Stackoverflow but didn't get an answer, yet. I use now this channel to give it more visibility, and hopefully find someone who can help. "*Context.* I have tens of SQL queries stored in separate files. For benchmarking purposes, I created an

Java.util.ArrayList is not a valid external type for schema of array

2016-10-13 Thread Mohamed Nadjib MAMI
in Parquet tables. Any help on solving/working around this would be very appreciated. *Regards, Grüße, **Cordialement,** Recuerdos, Saluti, προσρήσεις, 问候, تحياتي.* *Mohamed Nadjib Mami* *PhD Student - EIS Department - **Bonn University (Germany).* *About me! <http://www.strikingly.com/mohame

SQL query results: what is being cached?

2016-04-06 Thread Mohamed Nadjib MAMI
I noticed that in most SQL queries (sqlContext.sql(query)) I ran on Parquet tables that some results are returned faster after the first and second run of the query. Is this variation normal i.e. two executions of the same job can take different times? or there is some intermediate results

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
2:16, Ted Yu wrote: Can you obtain output from explain(true) on the query after cacheTable() call ? Potentially related JIRA: [SPARK-13657] [SQL] Support parsing very long AND/OR expressions On Thu, Mar 24, 2016 at 12:55 PM, Mohamed Nadjib MAMI <m...@iai.uni-bonn.de <mailto:m...@iai.uni-

Re: Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
f you can show snippet of your code, that would help give us more clue. Thanks On Mar 24, 2016, at 2:43 AM, Mohamed Nadjib MAMI <m...@iai.uni-bonn.de> wrote: Hi all, I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a problem with table caching (sqlCont

Spark SQL - java.lang.StackOverflowError after caching table

2016-03-24 Thread Mohamed Nadjib MAMI
Hi all, I'm running SQL queries (sqlContext.sql()) on Parquet tables and facing a problem with table caching (sqlContext.cacheTable()), using spark-shell of Spark 1.5.1. After I run the sqlContext.cacheTable(table), the sqlContext.sql(query) takes longer the first time (well, for the lazy

Parition RDD by key to create DataFrames

2016-03-15 Thread Mohamed Nadjib MAMI
Hi, I have a pair RDD of the form: (mykey, (value1, value2)) How can I create a DataFrame having the schema [V1 String, V2 String] to store [value1, value2] and save it into a Parquet table named "mykey"? /createDataFrame()/ method takes an RDD and a schema (StructType) in parameters. The

ErrorToken illegal character in a query having / @ $ . symbols

2016-02-08 Thread Mohamed Nadjib MAMI
Hello all, Could someone please help me figure out what wrong with my query that I'm running over Parquet tables? the query has the following form: weird_query = "SELECT a._example.com/aa/1.1/aa_, b._example.com/bb/1.2/bb_ FROM www$aa@aa a LEFT JOIN www$bb@bb b ON

Too many open files, why changing ulimit not effecting?

2016-02-05 Thread Mohamed Nadjib MAMI
Hello all, I'm getting the famous /java.io.FileNotFoundException: ... (Too many open files) /exception. What seemed to have helped people out, it haven't for me. I tried to set the ulimit via the command line /"ulimit -n"/, then I tried to add the following lines to

Re: Spark on YARN: java.lang.ClassCastException SerializedLambda to org.apache.spark.api.java.function.Function in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1

2015-12-06 Thread Mohamed Nadjib Mami
Your jars are not delivered to the workers. Have a look at this: http://stackoverflow.com/questions/24052899/how-to-make-it-easier-to-deploy-my-jar-to-spark-cluster-in-standalone-mode -- View this message in context:

Filling Parquet files by values in Value of a JavaPairRDD

2015-06-06 Thread Mohamed Nadjib Mami
Hello Sparkers, I'm reading data from a CSV file, applying some transformations and ending up with an RDD of pairs (String,Iterable). I have already prepared Parquet files. I want now to take the previous (key,value) RDD and populate the parquet files like follows: - key holds the name of the