Apache spark -2.1.0 question in Spark SQL

2018-04-03 Thread anbu
Please help me on the below error & give me different approach on the below data manipulation. Error:Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other

Re: How to delete empty columns in df when writing to parquet?

2018-04-03 Thread Junfeng Chen
You mean I should start two spark streaming application and read topics respectively? Regard, Junfeng Chen On Tue, Apr 3, 2018 at 10:31 PM, naresh Goud wrote: > I don’t see any option other than staring two individual queries. It’s > just a thought. > > Thank you,

unsubscribe

2018-04-03 Thread 15811225244
unsubscribe

Testing spark-testing-base. Error multiple SparkContext

2018-04-03 Thread Guillermo Ortiz
I'm doing a spark test with spark streaming, cassandra and kafka. I have an action which has an DStream as input and save to Cassandra and sometimes put some elements in Kafka. I'm using https://github.com/holdenk/spark-testing-base and kafka y cassandra in local. My method looks like: *def

bucketing in SPARK

2018-04-03 Thread Gourav Sengupta
Hi, I am going through the presentation https://databricks.com/session/hive-bucketing-in-apache-spark. Do we need to bucket both the tables for this to work? And is it mandatory that the number of buckets should be multiple of each other? Also if I export a persistent table to S3 will this

Re: Issue with using Generalized Linear Regression for Logistic Regression modeling

2018-04-03 Thread FireFly
It turns out that the weight was too large (with mean around 5000 and the standard deviation around 8000) and caused overflow. After scaling down the weight to, for example, numbers between 0 and 1, the code converged nicely. Spark did not report the overflow issue. We actually found it out by

Re: ORC native in Spark 2.3, with zlib, gives java.nio.BufferUnderflowException during read

2018-04-03 Thread Eirik Thorsnes
On 28. mars 2018 03:26, Dongjoon Hyun wrote: > You may hit SPARK-23355 (convertMetastore should not ignore table properties). > > Since it's a known Spark issue for all Hive tables (Parquet/ORC), could you > check that too? > > Bests, > Dongjoon. > Hi, I think you might be right, I can run

Re: How to pass sparkSession from driver to executor

2018-04-03 Thread Gourav Sengupta
Hi, the other thing that you may try doing is use the following in your SQL and then based on regular expressions filter out records based on which directory they came from. But I would be very interested to know the details which I have asked for in my earlier email. input_file_name()

Re: How to pass sparkSession from driver to executor

2018-04-03 Thread Gourav Sengupta
Hi, I think that what you are facing is documented in SPARK: http://spark.apache.org/docs/latest/rdd-programming-guide.html#understanding-closures- May I ask what are you trying to achieve here? From what I understand, you have a list of JSON files which you want to read separately, as they

Re: [Spark sql]: Re-execution of same operation takes less time than 1st

2018-04-03 Thread naresh Goud
Whenever spark read the data from it will have it in executor memory until and unless there is no room for new data read or processed. This is the beauty of spark. On Tue, Apr 3, 2018 at 12:42 AM snjv wrote: > Hi, > > When we execute the same operation twice, spark

Re: How does extending an existing parquet with columns affect impala/spark performance?

2018-04-03 Thread naresh Goud
>From spark point of view it shouldn’t effect. it’s possible to extend columns of new parquet files and it won’t affect Performance and not required to change spark application code. On Tue, Apr 3, 2018 at 9:14 AM Vitaliy Pisarev wrote: > This is not strictly a

How does extending an existing parquet with columns affect impala/spark performance?

2018-04-03 Thread Vitaliy Pisarev
This is not strictly a spark question but I'll give it a shot: have an existing setup of parquet files that are being queried from impala and from spark. I intend to add some 30 relatively 'heavy' columns to the parquet. Each column would store an array of structs. Each struct can have from 5 to