Encoding issue reading text file

2018-10-18 Thread Masf
Hi everyone, I´m trying to read a text file with UTF-16LE but I´m getting weird characters like this: �� W h e n My code is this one: sparkSession .read .format("text") .option("charset", "UTF-16LE") .load("textfile.txt") I´m using Spark 2.3.1. Any idea to fix

Dataset error with Encoder

2018-05-12 Thread Masf
Hi, I have the following issue, case class Item (c1: String, c2: String, c3: Option[BigDecimal]) import sparkSession.implicits._ val result = df.as[Item].groupByKey(_.c1).mapGroups((key, value) => { value }) But I get the following error in compilation time: Unable to find encoder for type

Hbase and Spark

2017-01-29 Thread Masf
I´m trying to build an application where is necessary to do bulkGets and bulkLoad on Hbase. I think that I could use this component https://github.com/hortonworks-spark/shc *Is it a good option??* But* I can't import it in my project*. Sbt cannot resolve hbase connector This is my build.sbt:

Testing with spark testing base

2015-12-05 Thread Masf
Hi. I'm testing "spark testing base". For example: class MyFirstTest extends FunSuite with SharedSparkContext{ def tokenize(f: RDD[String]) = { f.map(_.split("").toList) } test("really simple transformation"){ val input = List("hi", "hi miguel", "bye") val expected =

Re: Debug Spark

2015-12-02 Thread Masf
t you started >> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ >> >> Thanks >> Best Regards >> >> On Sun, Nov 29, 2015 at 9:48 PM, Masf <masfwo...@gmail.com> wrote: >> >>> Hi >>

Debug Spark

2015-11-29 Thread Masf
Hi Is it possible to debug spark locally with IntelliJ or another IDE? Thanks -- Regards. Miguel Ángel

Re: Debug Spark

2015-11-29 Thread Masf
Hi Ardo Some tutorial to debug with Intellij? Thanks Regards. Miguel. On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com> wrote: > hi, > > IntelliJ is just great for that! > > cheers, > Ardo. > > On Sun, Nov 29, 2015 at 5:18 PM, Masf <ma

Re: SQLContext load. Filtering files

2015-08-27 Thread Masf
function http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext as the second argument. Thanks Best Regards On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote: Hi. I'd like to read Avro files using this library https

Spark 1.3. Insert into hive parquet partitioned table from DataFrame

2015-08-20 Thread Masf
Hi. I have a dataframe and I want to insert these data into parquet partitioned table in Hive. In Spark 1.4 I can use df.write.partitionBy(x,y).format(parquet).mode(append).saveAsTable(tbl_parquet) but in Spark 1.3 I can't. How can I do it? Thanks -- Regards Miguel

SQLContext load. Filtering files

2015-08-19 Thread Masf
Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem

Dataframe Partitioning

2015-05-28 Thread Masf
Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner join between these dataframes, the result contains 200 partitions. *Why?* df1.join(df2, df1(id) === df2(id), Inner) = returns 200 partitions Thanks!!! -- Regards. Miguel Ángel

Re: Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I think that it's possible to do: *df.select($*, lit(null).as(col17, lit(null).as(col18, lit(null).as(col19,, lit(null).as(col26)* Any other advice? Miguel. On Wed, May 27, 2015 at 5:02 PM, Masf masfwo...@gmail.com wrote: Hi. I have a DataFrame with 16 columns (df1

Adding columns to DataFrame

2015-05-27 Thread Masf
Hi. I have a DataFrame with 16 columns (df1) and another with 26 columns(df2). I want to do a UnionAll. So, I want to add 10 columns to df1 in order to have the same number of columns in both dataframes. Is there some alternative to withColumn? Thanks -- Regards. Miguel Ángel

Re: DataFrame. Conditional aggregation

2015-05-27 Thread Masf
endrscp100 then 1 else 0 end test from j' Let me know if this works. On 26 May 2015 23:47, Masf masfwo...@gmail.com wrote: Hi I don't know how it works. For example: val result = joinedData.groupBy(col1,col2).agg( count(lit(1)).as(counter), min(col3).as(minimum), sum(case when endrscp 100

Re: DataFrame. Conditional aggregation

2015-05-26 Thread Masf
guha.a...@gmail.com wrote: Case when col2100 then 1 else col2 end On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote: Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM

DataFrame. Conditional aggregation

2015-05-25 Thread Masf
Hi. In a dataframe, How can I execution a conditional sentence in a aggregation. For example, Can I translate this SQL statement to DataFrame?: SELECT name, SUM(IF table.col2 100 THEN 1 ELSE table.col1) FROM table GROUP BY name Thanks -- Regards. Miguel

Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric. Q1: When I read parquet files, I've tested that Spark generates so many partitions as parquet files exist in the path. Q2: To reduce the number of partitions you can use rdd.repartition(x), x= number of partitions. Depend on your case, repartition could be a heavy task Regards.

Inserting Nulls

2015-05-05 Thread Masf
Hi. I have a spark application where I store the results into table (with HiveContext). Some of these columns allow nulls. In Scala, this columns are represented through Option[Int] or Option[Double].. Depend on the data type. For example: *val hc = new HiveContext(sc)* *var col1:

Re: Opening many Parquet files = slow

2015-04-15 Thread Masf
Hi guys Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files (250MB/file), it lasts 4 minutes. I have a cluster with 4 nodes and it seems me too slow. The load function is not available in Spark 1.2, so I can't test it Regards. Miguel. On Mon, Apr 13, 2015 at 8:12 PM,

Re: Increase partitions reading Parquet File

2015-04-14 Thread Masf
)? --- Original Message --- From: Masf masfwo...@gmail.com Sent: April 9, 2015 11:45 PM To: user@spark.apache.org Subject: Increase partitions reading Parquet File Hi I have this statement: val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/file.parquet) This code

Re: Error reading smallin in hive table with parquet format

2015-04-02 Thread Masf
1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote: Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1

Spark SQL. Memory consumption

2015-04-02 Thread Masf
Hi. I'm using Spark SQL 1.2. I have this query: CREATE TABLE test_MA STORED AS PARQUET AS SELECT field1 ,field2 ,field3 ,field4 ,field5 ,COUNT(1) AS field6 ,MAX(field7) ,MIN(field8) ,SUM(field9 / 100) ,COUNT(field10) ,SUM(IF(field11 -500, 1, 0)) ,MAX(field12) ,SUM(IF(field13 = 1, 1, 0))

Error reading smallin in hive table with parquet format

2015-04-01 Thread Masf
Hi. In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement: CREATE TABLE testTable STORED AS PARQUET AS SELECT field1 FROM table1 *field1 is SMALLINT. If table1 is in text format all it's ok, but if table1 is in parquet format, spark returns the following error*:

Re: Error in Delete Table

2015-03-31 Thread Masf
Hi Ted. Spark 1.2.0 an Hive 0.13.1 Regards. Miguel Angel. On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu yuzhih...@gmail.com wrote: Which Spark and Hive release are you using ? Thanks On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote: Hi. In HiveContext, when I put

Re: Too many open files

2015-03-30 Thread Masf
values: Have you done the above modification on all the machines in your Spark cluster ? If you use Ubuntu, be sure that the /etc/pam.d/common-session file contains the following line: session required pam_limits.so On Mon, Mar 30, 2015 at 5:08 AM, Masf masfwo...@gmail.com wrote: Hi

Too many open files

2015-03-30 Thread Masf
Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next values: * softnofile 100 * hardnofile 100 In spark-env.sh set ulimit -n 100 I've restarted the spark service and it

Re: Too many open files

2015-03-30 Thread Masf
the machines to get the ulimit effect (or relogin). What operation are you doing? Are you doing too many repartitions? Thanks Best Regards On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote: Hi I have a problem with temp data in Spark. I have fixed spark.shuffle.manager

Error in Delete Table

2015-03-27 Thread Masf
Hi. In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable If TestTable doesn't exist, spark returns an error: ERROR Hive: NoSuchObjectException(message:default.TestTable table not found) at

Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
of window function support in 1.4.0. But it's not a promise yet. Cheng On 3/26/15 7:27 PM, Arush Kharbanda wrote: Its not yet implemented. https://issues.apache.org/jira/browse/SPARK-1442 On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote: Hi. Are the Windowing

Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Hi. Are the Windowing and Analytics functions supported in Spark SQL (with HiveContext or not)? For example in Hive is supported https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Some tutorial or documentation where I can see all features supported by Spark

Re: Issues with SBT and Spark

2015-03-19 Thread Masf
Hi Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala 2.11 Regards On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan kvi...@vt.edu wrote: My current simple.sbt is name := SparkEpiFast version := 1.0 scalaVersion := 2.11.4 libraryDependencies +=

Hive error on partitioned tables

2015-03-17 Thread Masf
Hi. I'm running Spark 1.2.0. I have HiveContext and I execute the following query: select sum(field1 / 100) from table1 group by field2; field1 in hive metastore is a smallint. The schema detected by hivecontext is a int32: fileSchema: message schema { optional int32 field1;

Re: Spark SQL. Cast to Bigint

2015-03-17 Thread Masf
HiveContext for now? On Fri, Mar 13, 2015 at 4:48 AM, Masf masfwo...@gmail.com wrote: Hi. I have a query in Spark SQL and I can not covert a value to BIGINT: CAST(column AS BIGINT) or CAST(0 AS BIGINT) The output is: Exception in thread main java.lang.RuntimeException: [34.62] failure

Parquet and repartition

2015-03-16 Thread Masf
Hi all. When I specify the number of partitions and save this RDD in parquet format, my app fail. For example selectTest.coalesce(28).saveAsParquetFile(hdfs://vm-clusterOutput) However, it works well if I store data in text selectTest.coalesce(28).saveAsTextFile(hdfs://vm-clusterOutput) My

Re: Parquet and repartition

2015-03-16 Thread Masf
fail means here. On Mon, Mar 16, 2015 at 11:11 AM, Masf masfwo...@gmail.com wrote: Hi all. When I specify the number of partitions and save this RDD in parquet format, my app fail. For example selectTest.coalesce(28).saveAsParquetFile(hdfs://vm-clusterOutput) However, it works well

Spark SQL. Cast to Bigint

2015-03-13 Thread Masf
Hi. I have a query in Spark SQL and I can not covert a value to BIGINT: CAST(column AS BIGINT) or CAST(0 AS BIGINT) The output is: Exception in thread main java.lang.RuntimeException: [34.62] failure: ``DECIMAL'' expected but identifier BIGINT found Thanks!! Regards. Miguel Ángel

Re: Read parquet folders recursively

2015-03-12 Thread Masf
Thanks Best Regards On Wed, Mar 11, 2015 at 9:45 PM, Masf masfwo...@gmail.com wrote: Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel -- Saludos. Miguel Ángel

Read parquet folders recursively

2015-03-11 Thread Masf
Hi all Is it possible to read recursively folders to read parquet files? Thanks. -- Saludos. Miguel Ángel