Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Thanks for your response. What do you mean when you said "immediately return"? On Wed, Mar 28, 2018, 10:33 PM Jörn Franke wrote: > I don’t think select * is a good benchmark. You should do a more complex > operation, otherwise optimizes might see that you don’t do

Re: [SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Jörn Franke
I don’t think select * is a good benchmark. You should do a more complex operation, otherwise optimizes might see that you don’t do anything in the query and immediately return (similarly count might immediately return by using some statistics). > On 29. Mar 2018, at 02:03, Tin Vu

Unable to get results of intermediate dataset

2018-03-28 Thread Sunitha Chennareddy
Hi Team, I am new to Spark, my requirement is I have a huge list, which is converted to spark dataset and I need to operate on this dataset and store computed values in another object/dataset and store in memory for further processing. Approach I tried is : list is retrieved from third party in

unsubscribe

2018-03-28 Thread purna pradeep
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
val test_schema = DataType.fromJson(schema).asInstanceOf[StructType] val session = SparkHelper.getSparkSession val df1: DataFrame = session.read .format("json") .schema(test_schema) .option("inferSchema","false") .option("mode","FAILFAST") .load("src/test/resources/*.gz") df1.show(80)

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
I've had more success exporting the schema toJson and importing that. Something like: val df1: DataFrame = session.read .format("json") .schema(test_schema) .option("inferSchema","false") .option("mode","FAILFAST") .load("src/test/resources/*.gz") df1.show(80) On Wed, Mar 28, 2018

[SparkSQL] SparkSQL performance on small TPCDS tables is very low when compared to Drill or Presto

2018-03-28 Thread Tin Vu
Hi, I am executing a benchmark to compare performance of SparkSQL, Apache Drill and Presto. My experimental setup: - TPCDS dataset with scale factor 100 (size 100GB). - Spark, Drill, Presto have a same number of workers: 12. - Each worked has same allocated amount of memory: 4GB. -

Unsubscribe

2018-03-28 Thread purna pradeep
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
The to String representation look like where "someName" is unique: StructType(StructField("someName",StringType,true), StructField("someName",StructType(StructField("someName",StructType(StructField("someName",StringType,true), StructField("someName",StringType,true)),true),

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events.  The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step

spark-sql importing schemas from catalogString or schema.toString()

2018-03-28 Thread Colin Williams
I've been learning spark-sql and have been trying to export and import some of the generated schemas to edit them. I've been writing the schemas to strings like df1.schema.toString() and df.schema.catalogString But I've been having trouble loading the schemas created. Does anyone know if it's

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
I suppose that it's hardly possible that this issue is connected with the string encoding, because - "pr^?files.10056.10040" should be "profiles.10056.10040" and is defined as constant in the source code -

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Jörn Franke
Encoding issue of the data? Eg spark uses utf-8 , but source encoding is different? > On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky wrote: > > Hello guys, > > I'm using Spark 2.2.0 and from time to time my job fails printing into > the log the following errors > >

DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
Hello guys, I'm using Spark 2.2.0 and from time to time my job fails printing into the log the following errors scala.MatchError: profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ {     "numRowsTotal" : 1617339,     "numRowsUpdated" : 9647   }, {     "numRowsTotal" :

Re: SparkStraming job break with shuffle file not found

2018-03-28 Thread Lucas Kacher
I have been running into this as well, but I am using S3 for checkpointing so I chalked it up to network partitioning with s3-isnt-hdfs as my storage location. But it seems that you are indeed using hdfs, so I wonder if there is another underlying issue. On Wed, Mar 28, 2018 at 8:21 AM, Jone

SparkStraming job break with shuffle file not found

2018-03-28 Thread Jone Zhang
The spark streaming job running for a few days,then fail as below What is the possible reason? *18/03/25 07:58:37 ERROR yarn.ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 80018.0 failed 4 times, most recent

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-28 Thread Jiří Syrový
Quick comment: Excel CSV (very special case though) supports arrays in CSV using "\n" inside quotes, but you have to use as EOL for the row "\r\n" (Windows EOL). Cheers, Jiri 2018-03-28 14:14 GMT+02:00 Yong Zhang : > Your dataframe has array data type, which is NOT

Re: java.lang.UnsupportedOperationException: CSV data source does not support struct/ERROR RetryingBlockFetcher

2018-03-28 Thread Yong Zhang
Your dataframe has array data type, which is NOT supported by CSV. How csv file can include array or other nest structure? If you want your data to be human readable text, write out as json in your case then. Yong From: Mina Aslani

Testing with spark-base-test

2018-03-28 Thread Guillermo Ortiz
I'm using spark-unit-test and I don't get to compile the code. test("Testging") { val inputInsert = A("data2") val inputDelete = A("data1") val outputInsert = B(1) val outputDelete = C(1) val input = List(List(inputInsert), List(inputDelete)) val output =

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Gourav Sengupta
Hi Michael, I think that is what I am trying to show here as the documentation mentions "NOTE: In Spark 1.0 and later this will be overridden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the cluster manager." So, in a way I am supporting your

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-28 Thread Michael Shtelma
Hi, this property will be used in YARN mode only by the driver. Executors will use the properties coming from YARN for storing temporary files. Best, Michael On Wed, Mar 28, 2018 at 7:37 AM, Gourav Sengupta wrote: > Hi, > > > As per documentation in:

Re: [Spark Java] Add new column in DataSet based on existed column

2018-03-28 Thread Divya Gehlot
Hi , Here is example snippet in scala // Convert to a Date typeval timestamp2datetype: (Column) => Column = (x) => { to_date(x) }df = df.withColumn("date", timestamp2datetype(col("end_date"))) Hope this helps ! Thanks, Divya On 28 March 2018 at 15:16, Junfeng Chen

[Spark Java] Add new column in DataSet based on existed column

2018-03-28 Thread Junfeng Chen
I am working on adding a date transformed field on existed dataset. The current dataset contains a column named timestamp in ISO format. I want to parse this field to joda time type, and then extract the year, month, day, hour info as new column attaching to original dataset. I have tried