SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Ajay Srivastava
Hi, This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark 2.0.But resolution is not mentioned there. In our use case, there are big as well as many small parquet files which are being queried using spark sql.Can someone please explain what is the fix and how I can use it

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
is enabled for a particular stage. | | spark.speculation.multiplier | 1.5 | How many times slower a task is than the median to be considered for speculation. |   On Thursday, January 15, 2015 5:44 AM, Ajay Srivastava a_k_srivast...@yahoo.com.INVALID wrote: Hi, My spark job is taking long

Re: Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
://spark.apache.org/docs/latest/tuning.html#serialized-rdd-storage Cheers,- Nicos On Jan 15, 2015, at 6:49 AM, Ajay Srivastava a_k_srivast...@yahoo.com.INVALID wrote: Thanks RK. I can turn on speculative execution but I am trying to find out actual reason for delay as it happens on any node. Any idea about

Some tasks are taking long time

2015-01-15 Thread Ajay Srivastava
Hi, My spark job is taking long time. I see that some tasks are taking longer time for same amount of data and shuffle read/write. What could be the possible reasons for it ? The thread-dump sometimes show that all the tasks in an executor are waiting with following stack trace - Executor task

Re: Creating RDD from only few columns of a Parquet file

2015-01-13 Thread Ajay Srivastava
Setting spark.sql.hive.convertMetastoreParquet to true has fixed this. Regards,Ajay On Tuesday, January 13, 2015 11:50 AM, Ajay Srivastava a_k_srivast...@yahoo.com.INVALID wrote: Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile(people.parquet

Creating RDD from only few columns of a Parquet file

2015-01-12 Thread Ajay Srivastava
Hi,I am trying to read a parquet file using -val parquetFile = sqlContext.parquetFile(people.parquet) There is no way to specify that I am interested in reading only some columns from disk. For example, If the parquet file has 10 columns and want to read only 3 columns from disk. We have done

Spark summit 2014 videos ?

2014-07-10 Thread Ajay Srivastava
Hi, I did not find any videos on apache spark channel in youtube yet. Any idea when these will be made available ? Regards, Ajay

OFF_HEAP storage level

2014-07-04 Thread Ajay Srivastava
Hi, I was checking different storage level of an RDD and found OFF_HEAP. Has anybody used this level ? If i use this level, where will data be stored ? If not in heap, does it mean that we can avoid GC ? How can I use this level ? I did not find anything in archive regarding this. Can someone

Re: Join : Giving incorrect result

2014-06-06 Thread Ajay Srivastava
a patch for it here:  https://github.com/apache/spark/pull/986. Feel free to try that if you’d like; it will also be in 0.9.2 and 1.0.1. Matei On Jun 5, 2014, at 12:19 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote: Sorry for replying late. It was night here. Lian/Matei, Here is the code

Re: Join : Giving incorrect result

2014-06-05 Thread Ajay Srivastava
On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote: Maybe your two workers have different assembly jar files? I just ran into a similar problem that my spark-shell is using a different jar file than my workers - got really confusing results. On Jun 4, 2014 8:33 AM, Ajay

Join : Giving incorrect result

2014-06-04 Thread Ajay Srivastava
Hi, I am doing join of two RDDs which giving different results ( counting number of records ) each time I run this code on same input. The input files are large enough to be divided in two splits. When the program runs on two workers with single core assigned to these, output is consistent