to_avro/from_avro inserts extra values from Kafka

2020-05-12 Thread Alex Nastetsky
Hi all, I create a dataframe, convert it to Avro with to_avro and write it to Kafka. Then I read it back out with from_avro. (Not using Schema Registry.) The problem is that the values skip every other field in the result. I expect: +-++-+---+ |firstName|lastName|color|

"where" clause able to access fields not in its schema

2019-02-13 Thread Alex Nastetsky
I don't know if this is a bug or a feature, but it's a bit counter-intuitive when reading code. The "b" dataframe does not have field "bar" in its schema, but is still able to filter on that field. scala> val a = sc.parallelize(Seq((1,10),(2,20))).toDF("foo","bar") a:

Re: partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky
it. On Mon, Feb 26, 2018 at 5:47 PM, naresh Goud <nareshgoud.du...@gmail.com> wrote: > is this helps? > > sc.parallelize(List((1,10),(2,20))).toDF("foo","bar").map((" > foo","bar")=>("foo",("foo","bar"))

partitionBy with partitioned column in output?

2018-02-26 Thread Alex Nastetsky
Is there a way to make outputs created with "partitionBy" to contain the partitioned column? When reading the output with Spark or Hive or similar, it's less of an issue because those tools know how to perform partition discovery. But if I were to load the output into an external data warehouse or

Dataset API inconsistencies

2018-01-09 Thread Alex Nastetsky
I am finding using the Dataset API to be very cumbersome to use, which is unfortunate, as I was looking forward to the type-safety after coming from a Dataframe codebase. This link summarizes my troubles: http://loicdescotte. github.io/posts/spark2-datasets-type-safety/ The problem is having to

Re: spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky
d. In your example, what happens if data is of only 2 rows? > On 27 Jul 2016 00:57, "Alex Nastetsky" <alex.nastet...@vervemobile.com> > wrote: > >> Spark SQL has a "first" function that returns the first item in a group. >> Is there a similar function

spark sql aggregate function "Nth"

2016-07-26 Thread Alex Nastetsky
Spark SQL has a "first" function that returns the first item in a group. Is there a similar function, perhaps in a third party lib, that allows you to return an arbitrary (e.g. 3rd) item from the group? Was thinking of writing a UDAF for it, but didn't want to reinvent the wheel. My endgoal is to

Databricks Cloud vs AWS EMR

2016-01-26 Thread Alex Nastetsky
As a user of AWS EMR (running Spark and MapReduce), I am interested in potential benefits that I may gain from Databricks Cloud. I was wondering if anyone has used both and done comparison / contrast between the two services. In general, which resource manager(s) does Databricks Cloud use for

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky
Ran into this need myself. Does Spark have an equivalent of "mapreduce. input.fileinputformat.list-status.num-threads"? Thanks. On Thu, Jul 23, 2015 at 8:50 PM, Cheolsoo Park wrote: > Hi, > > I am wondering if anyone has successfully enabled >

Re: Enabling mapreduce.input.fileinputformat.list-status.num-threads in Spark?

2016-01-12 Thread Alex Nastetsky
Alex, see this jira- > https://issues.apache.org/jira/browse/SPARK-9926 > > On Tue, Jan 12, 2016 at 10:55 AM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> Ran into this need myself. Does Spark have an equivalent of "mapreduce. >> input.fileinp

Spark SQL UDAF works fine locally, OutOfMemory on YARN

2015-11-16 Thread Alex Nastetsky
Hi, I am using Spark 1.5.1. I have a Spark SQL UDAF that works fine on a tiny dataset (13 narrow rows) in local mode, but runs out of memory on YARN about half the time (OutOfMemory: Java Heap Space). The rest of the time, it works on YARN. Note that in all instances, the input data is the

Re: Spark 1.5 UDAF ArrayType

2015-11-10 Thread Alex Nastetsky
Hi, I believe I ran into the same bug in 1.5.0, although my error looks like this: Caused by: java.lang.ClassCastException: [Lcom.verve.spark.sql.ElementWithCount; cannot be cast to org.apache.spark.sql.types.ArrayData at

Re: Sort Merge Join

2015-11-02 Thread Alex Nastetsky
with the identical join keys will be loaded by the > same node/task , since lots of factors need to be considered, like task > pool size, cluster size, source format, storage, data locality etc.,. > > I’ll agree it’s worth to optimize it for performance concerns, and > actually in Hive

Sort Merge Join

2015-11-01 Thread Alex Nastetsky
Hi, I'm trying to understand SortMergeJoin (SPARK-2213). 1) Once SortMergeJoin is enabled, will it ever use ShuffledHashJoin? For example, in the code below, the two datasets have different number of partitions, but it still does a SortMerge join after a "hashpartitioning". CODE: val

foreachPartition

2015-10-30 Thread Alex Nastetsky
I'm just trying to do some operation inside foreachPartition, but I can't even get a simple println to work. Nothing gets printed. scala> val a = sc.parallelize(List(1,2,3)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at :21 scala> a.foreachPartition(p =>

Re: foreachPartition

2015-10-30 Thread Alex Nastetsky
gt; On Fri, Oct 30, 2015 at 4:42 PM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> I'm just trying to do some operation inside foreachPartition, but I can't >> even get a simple println to work. Nothing gets printed. >> >> scala> val a = sc.pa

CompositeInputFormat in Spark

2015-10-30 Thread Alex Nastetsky
Does Spark have an implementation similar to CompositeInputFormat in MapReduce? CompositeInputFormat joins multiple datasets prior to the mapper, that are partitioned the same way with the same number of partitions, using the "part" number in the file name in each dataset to figure out which file

Re: writing avro parquet

2015-10-19 Thread Alex Nastetsky
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to use it on myDF.rdd instead of converting it to a PairRDD first. On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Using Spark 1.5.1, Parquet 1.7.0. > > I'm tryin

writing avro parquet

2015-10-19 Thread Alex Nastetsky
Using Spark 1.5.1, Parquet 1.7.0. I'm trying to write Avro/Parquet files. I have this code: sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, classOf[AvroWriteSupport].getName) AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) myDF.write.parquet(outputPath)

spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
I save my dataframe to avro with spark-avro 1.0.0 and it looks like this (using avro-tools tojson): {"field1":"value1","field2":976200} {"field1":"value2","field2":976200} {"field1":"value3","field2":614100} But when I use spark-avro 2.0.1, it looks like this:

Re: spark-avro 2.0.1 generates strange schema (spark-avro 1.0.0 is fine)

2015-10-14 Thread Alex Nastetsky
; > On Wed, Oct 14, 2015 at 1:38 PM, Alex Nastetsky < > alex.nastet...@vervemobile.com> wrote: > >> I save my dataframe to avro with spark-avro 1.0.0 and it looks like this >> (using avro-tools tojson): >> >> {"field1":"value1","field2&quo

dataframes and numPartitions

2015-10-14 Thread Alex Nastetsky
A lot of RDD methods take a numPartitions parameter that lets you specify the number of partitions in the result. For example, groupByKey. The DataFrame counterparts don't have a numPartitions parameter, e.g. groupBy only takes a bunch of Columns as params. I understand that the DataFrame API is

dataframe json schema scan

2015-08-20 Thread Alex Nastetsky
The doc for DataFrameReader#json(RDD[String]) method says Unless the schema is specified using schema function, this function goes through the input once to determine the input schema. https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader Why is this

Re: restart from last successful stage

2015-07-29 Thread Alex Nastetsky
cause recomputing all the stages. Consider running the job by generating the RDD from Dataframe and then using that. Of course, you can use caching in both core and DataFrames, which will solve all these concerns. On Tue, Jul 28, 2015 at 1:03 PM, Alex Nastetsky alex.nastet

restart from last successful stage

2015-07-28 Thread Alex Nastetsky
Is it possible to restart the job from the last successful stage instead of from the beginning? For example, if your job has stages 0, 1 and 2 .. and stage 0 takes a long time and is successful, but the job fails on stage 1, it would be useful to be able to restart from the output of stage 0

Calling rdd() on a DataFrame causes stage boundary

2015-06-22 Thread Alex Nastetsky
When I call rdd() on a DataFrame, it ends the current stage and starts a new one that just maps the DataFrame to rdd and nothing else. It doesn't seem to do a shuffle (which is good and expected), but then why does why is there a separate stage? I also thought that stages only end when there's a

different schemas per row with DataFrames

2015-06-18 Thread Alex Nastetsky
I am reading JSON data that has different schemas for every record. That is, for a given field that would have a null value, it's simply absent from that record (and therefore, its schema). I would like to use the DataFrame API to select specific fields from this data, and for fields that are