Encoding issue reading text file

2018-10-18 Thread Masf
Hi everyone,

I´m trying to read a text file with UTF-16LE but I´m getting weird
characters like this:
�� W h e n

My code is this one:

sparkSession
  .read
  .format("text")
  .option("charset", "UTF-16LE")
  .load("textfile.txt")

I´m using Spark 2.3.1. Any idea to fix it?

Thanks
-- 


Regards.
Miguel Ángel


Dataset error with Encoder

2018-05-12 Thread Masf
Hi,

I have the following issue,

case class Item (c1: String, c2: String, c3: Option[BigDecimal])
import sparkSession.implicits._
val result = df.as[Item].groupByKey(_.c1).mapGroups((key, value) => { value
})

But I get the following error in compilation time:

Unable to find encoder for type stored in a Dataset.  Primitive types
(Int, String, etc) and Product types (case classes) are supported by
importing spark.implicits._  Support for serializing other types will
be added in future releases.


What am I missing?

Thanks


Hbase and Spark

2017-01-29 Thread Masf
I´m trying to build an application where is necessary to do bulkGets and
bulkLoad on Hbase.

I think that I could use this component
https://github.com/hortonworks-spark/shc
*Is it a good option??*

But* I can't import it in my project*. Sbt cannot resolve hbase
connector
This is my build.sbt:

version := "1.0"
scalaVersion := "2.10.6"

mainClass in assembly := Some("com.location.userTransaction")

assemblyOption in assembly ~= { _.copy(includeScala = false) }

resolvers += "Typesafe Repo" at "http://repo.typesafe.com/typesafe/releases/
"

val sparkVersion = "1.6.0"
val jettyVersion = "8.1.14.v20131031"
val hbaseConnectorVersion = "1.0.0-1.6-s_2.10"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "org.apache.spark" %% "spark-hive" % sparkVersion % "provided"
)
libraryDependencies += "com.hortonworks" % "shc" % hbaseConnectorVersion
libraryDependencies += "org.eclipse.jetty" % "jetty-client" % jettyVersion

-- 


Saludos.
Miguel Ángel


Testing with spark testing base

2015-12-05 Thread Masf
Hi.

I'm testing "spark testing base". For example:

class MyFirstTest extends FunSuite with SharedSparkContext{
  def tokenize(f: RDD[String]) = {
f.map(_.split("").toList)
  }

  test("really simple transformation"){
val input = List("hi", "hi miguel", "bye")
val expected = List(List("hi"), List("hi", "holden"), List("bye"))
assert(tokenize(sc.parallelize(input)).collect().toList === expected)
  }

}


But...How can I launch this test??
Spark-submit or IntelliJ?

Thanks.

-- 
Regards
Miguel


Re: Debug Spark

2015-12-02 Thread Masf
This is very intersting.

Thanks!!!

On Thu, Dec 3, 2015 at 8:28 AM, Sudhanshu Janghel <
sudhanshu.jang...@cloudwick.com> wrote:

> Hi,
>
> Here is a doc that I had created for my team. This has steps along with
> snapshots of how to setup debugging in spark using IntelliJ locally.
>
>
> https://docs.google.com/a/cloudwick.com/document/d/13kYPbmK61di0f_XxxJ-wLP5TSZRGMHE6bcTBjzXD0nA/edit?usp=sharing
>
> Kind Regards,
> Sudhanshu
>
> On Thu, Dec 3, 2015 at 6:46 AM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> This doc will get you started
>> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ
>>
>> Thanks
>> Best Regards
>>
>> On Sun, Nov 29, 2015 at 9:48 PM, Masf <masfwo...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>>
>>> Thanks
>>>
>>> --
>>> Regards.
>>> Miguel Ángel
>>>
>>
>>
>


-- 


Saludos.
Miguel Ángel


Debug Spark

2015-11-29 Thread Masf
Hi

Is it possible to debug spark locally with IntelliJ or another IDE?

Thanks

-- 
Regards.
Miguel Ángel


Re: Debug Spark

2015-11-29 Thread Masf
Hi Ardo


Some tutorial to debug with Intellij?

Thanks

Regards.
Miguel.


On Sun, Nov 29, 2015 at 5:32 PM, Ndjido Ardo BAR <ndj...@gmail.com> wrote:

> hi,
>
> IntelliJ is just great for that!
>
> cheers,
> Ardo.
>
> On Sun, Nov 29, 2015 at 5:18 PM, Masf <masfwo...@gmail.com> wrote:
>
>> Hi
>>
>> Is it possible to debug spark locally with IntelliJ or another IDE?
>>
>> Thanks
>>
>> --
>> Regards.
>> Miguel Ángel
>>
>
>


-- 


Saludos.
Miguel Ángel


Re: SQLContext load. Filtering files

2015-08-27 Thread Masf
Thanks Akhil, I will have a look.

I have a dude regarding to spark streaming and filestream. If spark
streaming crashs and while spark was down new files are created in input
folder, when spark streaming is launched again, how can I process these
files?

Thanks.
Regards.
Miguel.



On Thu, Aug 27, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Have a look at the spark streaming. You can make use of the ssc.fileStream.

 Eg:

 val avroStream = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
   AvroKeyInputFormat[GenericRecord]](input)

 You can also specify a filter function
 http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext
 as the second argument.

 Thanks
 Best Regards

 On Wed, Aug 19, 2015 at 10:46 PM, Masf masfwo...@gmail.com wrote:

 Hi.

 I'd like to read Avro files using this library
 https://github.com/databricks/spark-avro

 I need to load several files from a folder, not all files. Is there some
 functionality to filter the files to load?

 And... Is is possible to know the name of the files loaded from a folder?

 My problem is that I have a folder where an external process is inserting
 files every X minutes and I need process these files once, and I can't
 move, rename or copy the source files.


 Thanks
 --

 Regards
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Spark 1.3. Insert into hive parquet partitioned table from DataFrame

2015-08-20 Thread Masf
Hi.

I have a dataframe and I want to insert these data into parquet partitioned
table in Hive.

In Spark 1.4 I can use
df.write.partitionBy(x,y).format(parquet).mode(append).saveAsTable(tbl_parquet)

but in Spark 1.3 I can't. How can I do it?

Thanks

-- 
Regards
Miguel


SQLContext load. Filtering files

2015-08-19 Thread Masf
Hi.

I'd like to read Avro files using this library
https://github.com/databricks/spark-avro

I need to load several files from a folder, not all files. Is there some
functionality to filter the files to load?

And... Is is possible to know the name of the files loaded from a folder?

My problem is that I have a folder where an external process is inserting
files every X minutes and I need process these files once, and I can't
move, rename or copy the source files.


Thanks
-- 

Regards
Miguel Ángel


Dataframe Partitioning

2015-05-28 Thread Masf
Hi.

I have 2 dataframe with 1 and 12 partitions respectively. When I do a inner
join between these dataframes, the result contains 200 partitions. *Why?*

df1.join(df2, df1(id) === df2(id), Inner) = returns 200 partitions


Thanks!!!
-- 


Regards.
Miguel Ángel


Re: Adding columns to DataFrame

2015-05-27 Thread Masf
Hi.

I think that it's possible to do:

*df.select($*, lit(null).as(col17, lit(null).as(col18,
lit(null).as(col19,, lit(null).as(col26)*

Any other advice?

Miguel.

On Wed, May 27, 2015 at 5:02 PM, Masf masfwo...@gmail.com wrote:

 Hi.

 I have a DataFrame with 16 columns (df1) and another with 26 columns(df2).
 I want to do a UnionAll.  So, I want to add 10 columns to df1 in order to
 have the same number of columns in both dataframes.

 Is there some alternative to withColumn?

 Thanks

 --
 Regards.
 Miguel Ángel




-- 


Saludos.
Miguel Ángel


Adding columns to DataFrame

2015-05-27 Thread Masf
Hi.

I have a DataFrame with 16 columns (df1) and another with 26 columns(df2).
I want to do a UnionAll.  So, I want to add 10 columns to df1 in order to
have the same number of columns in both dataframes.

Is there some alternative to withColumn?

Thanks

-- 
Regards.
Miguel Ángel


Re: DataFrame. Conditional aggregation

2015-05-27 Thread Masf
Yes. I think that your sql solution will work but I was looking for a
solution with DataFrame API. I had thought to use UDF such as:

val myFunc = udf {(input: Int) = {if (input  100) 1 else 0}}

Although I'd like to know if it's possible to do it directly in the
aggregation inserting a lambda function or something else.

Thanks

Regards.
Miguel.


On Wed, May 27, 2015 at 1:06 AM, ayan guha guha.a...@gmail.com wrote:

 For this, I can give you a SQL solution:

 joined data.registerTempTable('j')

 Res=ssc.sql('select col1,col2, count(1) counter, min(col3) minimum,
 sum(case when endrscp100 then 1 else 0 end test from j'

 Let me know if this works.
 On 26 May 2015 23:47, Masf masfwo...@gmail.com wrote:

 Hi
 I don't know how it works. For example:

 val result = joinedData.groupBy(col1,col2).agg(
   count(lit(1)).as(counter),
   min(col3).as(minimum),
   sum(case when endrscp 100 then 1 else 0 end).as(test)
 )

 How can I do it?

 Thanks
 Regards.
 Miguel.

 On Tue, May 26, 2015 at 12:35 AM, ayan guha guha.a...@gmail.com wrote:

 Case when col2100 then 1 else col2 end
 On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote:

 Hi.

 In a dataframe, How can I execution a conditional sentence in a
 aggregation. For example, Can I translate this SQL statement to DataFrame?:

 SELECT name, SUM(IF table.col2  100 THEN 1 ELSE table.col1)
 FROM table
 GROUP BY name

 Thanks
 --
 Regards.
 Miguel




 --


 Saludos.
 Miguel Ángel




-- 


Saludos.
Miguel Ángel


Re: DataFrame. Conditional aggregation

2015-05-26 Thread Masf
Hi
I don't know how it works. For example:

val result = joinedData.groupBy(col1,col2).agg(
  count(lit(1)).as(counter),
  min(col3).as(minimum),
  sum(case when endrscp 100 then 1 else 0 end).as(test)
)

How can I do it?

Thanks
Regards.
Miguel.

On Tue, May 26, 2015 at 12:35 AM, ayan guha guha.a...@gmail.com wrote:

 Case when col2100 then 1 else col2 end
 On 26 May 2015 00:25, Masf masfwo...@gmail.com wrote:

 Hi.

 In a dataframe, How can I execution a conditional sentence in a
 aggregation. For example, Can I translate this SQL statement to DataFrame?:

 SELECT name, SUM(IF table.col2  100 THEN 1 ELSE table.col1)
 FROM table
 GROUP BY name

 Thanks
 --
 Regards.
 Miguel




-- 


Saludos.
Miguel Ángel


DataFrame. Conditional aggregation

2015-05-25 Thread Masf
Hi.

In a dataframe, How can I execution a conditional sentence in a
aggregation. For example, Can I translate this SQL statement to DataFrame?:

SELECT name, SUM(IF table.col2  100 THEN 1 ELSE table.col1)
FROM table
GROUP BY name

Thanks
-- 
Regards.
Miguel


Re: Parquet number of partitions

2015-05-05 Thread Masf
Hi Eric.

Q1:
When I read parquet files, I've tested that Spark generates so many
partitions as parquet files exist in the path.

Q2:
To reduce the number of partitions you can use rdd.repartition(x), x=
number of partitions. Depend on your case, repartition could be a heavy task


Regards.
Miguel.

On Tue, May 5, 2015 at 3:56 PM, Eric Eijkelenboom 
eric.eijkelenb...@gmail.com wrote:

 Hello guys

 Q1: How does Spark determine the number of partitions when reading a
 Parquet file?

 val df = sqlContext.parquetFile(path)

 Is it some way related to the number of Parquet row groups in my input?

 Q2: How can I reduce this number of partitions? Doing this:

 df.rdd.coalesce(200).count

 from the spark-shell causes job execution to hang…

 Any ideas? Thank you in advance.

 Eric
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 


Saludos.
Miguel Ángel


Inserting Nulls

2015-05-05 Thread Masf
Hi.

I have a spark application where I store the results into table (with
HiveContext). Some of these columns allow nulls. In Scala, this columns are
represented through Option[Int] or Option[Double].. Depend on the data type.

For example:

*val hc = new HiveContext(sc)*
*var col1: Option[Ingeger] = None*
*...*

*val myRow = org.apache.spark.sql.Row(col1, ...)*

*val mySchema = StructType(Array(StructField(Column1, IntegerType,
true)))*

*val TableOutputSchemaRDD = hc.applySchema(myRow, mySchema)*
*hc.registerRDDAsTable(TableOutputSchemaRDD, result_intermediate)*
*hc.sql(CREATE TABLE table_output STORED AS PARQUET AS SELECT * FROM
result_intermediate)*

Produce:

java.lang.ClassCastException: scala.Some cannot be cast to java.lang.Integer
at
org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaIntObjectInspector.get(JavaIntObjectInspector.java:40)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:247)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




Thanks!
--

Regards.
Miguel Ángel


Re: Opening many Parquet files = slow

2015-04-15 Thread Masf
Hi guys

Regarding to parquet files. I have Spark 1.2.0 and reading 27 parquet files
(250MB/file), it lasts 4 minutes.

I have a cluster with 4 nodes and it seems me too slow.

The load function is not available in Spark 1.2, so I can't test it


Regards.
Miguel.

On Mon, Apr 13, 2015 at 8:12 PM, Eric Eijkelenboom 
eric.eijkelenb...@gmail.com wrote:

 Hi guys

 Does anyone know how to stop Spark from opening all Parquet files before
 starting a job? This is quite a show stopper for me, since I have 5000
 Parquet files on S3.

 Recap of what I tried:

 1. Disable schema merging with: sqlContext.load(“parquet,
 Map(mergeSchema - false”, path - “s3://path/to/folder))
 This opens most files in the folder (17 out of 21 in my small
 example). For 5000 files on S3, sqlContext.load() takes 30 minutes to
 complete.

 2. Use the old api with:
 sqlContext.setConf(spark.sql.parquet.useDataSourceApi, false”)
 Now sqlContext.parquetFile() only opens a few files and prints the
 schema: so far so good! However, as soon as I run e.g. a count() on the
 dataframe, Spark still opens all files _before_ starting a job/stage.
 Effectively this moves the delay from load() to count() (or any other
 action I presume).

 3. Run Spark 1.3.1-rc2.
 sqlContext.load() took about 30 minutes for 5000 Parquet files on S3,
 the same as 1.3.0.

 Any help would be greatly appreciated!

 Thanks a lot.

 Eric




 On 10 Apr 2015, at 16:46, Eric Eijkelenboom eric.eijkelenb...@gmail.com
 wrote:

 Hi Ted

 Ah, I guess the term ‘source’ confused me :)

 Doing:

 sqlContext.load(“parquet, Map(mergeSchema - false”, path - “path
 to a single day of logs))

 for 1 directory with 21 files, Spark opens 17 files:

 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
 s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-72'
 for reading
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-72' for
 reading at position '261573524'
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
 s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-74'
 for reading
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
 s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-77'
 for reading
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening '
 s3n://mylogs/logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-62'
 for reading
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-74' for
 reading at position '259256807'
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-77' for
 reading at position '260002042'
 15/04/10 14:31:42 INFO s3native.NativeS3FileSystem: Opening key
 'logs/=2015/mm=2/dd=1/bab8c575-29e7-4456-a1a1-23f8f746e46a-62' for
 reading at position ‘260875275'
 etc.

 I can’t seem to pass a comma-separated list of directories to load(), so
 in order to load multiple days of logs, I have to point to the root folder
 and depend on auto-partition discovery (unless there’s a smarter way).

 Doing:

 sqlContext.load(“parquet, Map(mergeSchema - false”, path - “path
 to root log dir))

 starts opening what seems like all files (I killed the process after a
 couple of minutes).

 Thanks for helping out.
 Eric





-- 


Saludos.
Miguel Ángel


Re: Increase partitions reading Parquet File

2015-04-14 Thread Masf
Hi.

It doesn't work.

val file = SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/
file.parquet)
file.repartition(127)

println(h.partitions.size.toString()) -- Return 27!

Regards


On Fri, Apr 10, 2015 at 4:50 PM, Felix C felixcheun...@hotmail.com wrote:

  RDD.repartition(1000)?

 --- Original Message ---

 From: Masf masfwo...@gmail.com
 Sent: April 9, 2015 11:45 PM
 To: user@spark.apache.org
 Subject: Increase partitions reading Parquet File

  Hi

  I have this statement:

  val file =
 SqlContext.parquetfile(hdfs://node1/user/hive/warehouse/file.parquet)

  This code generates as many partitions as files are. So, I want to
 increase the number of partitions.
 I've tested coalesce (file.coalesce(100)) but the number of partitions
 doesn't change.

  How can I increase the number of partitions?

  Thanks

  --


 Regards.
 Miguel Ángel




-- 


Saludos.
Miguel Ángel


Re: Error reading smallin in hive table with parquet format

2015-04-02 Thread Masf
No, in my company are using cloudera distributions and 1.2.0 is the last
version of spark.

Thanks

On Wed, Apr 1, 2015 at 8:08 PM, Michael Armbrust mich...@databricks.com
wrote:

 Can you try with Spark 1.3?  Much of this code path has been rewritten /
 improved in this version.

 On Wed, Apr 1, 2015 at 7:53 AM, Masf masfwo...@gmail.com wrote:


 Hi.

 In Spark SQL 1.2.0, with HiveContext, I'm executing the following
 statement:

 CREATE TABLE testTable STORED AS PARQUET AS
  SELECT
 field1
  FROM table1

 *field1 is SMALLINT. If table1 is in text format all it's ok, but if
 table1 is in parquet format, spark returns the following error*:



 15/04/01 16:48:24 ERROR TaskSetManager: Task 26 in stage 1.0 failed 1
 times; aborting job
 Exception in thread main org.apache.spark.SparkException: Job aborted
 due to stage failure: Task 26 in stage 1.0 failed 1 times, most recent
 failure: Lost task 26.0 in stage 1.0 (TID 28, localhost):
 java.lang.ClassCastException: java.lang.Integer cannot be cast to
 java.lang.Short
 at
 org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:251)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178)
 at
 org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)

 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



 Thanks!
 --


 Regards.
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Spark SQL. Memory consumption

2015-04-02 Thread Masf
Hi.

I'm using Spark SQL 1.2. I have this query:

CREATE TABLE test_MA STORED AS PARQUET AS
 SELECT
field1
,field2
,field3
,field4
,field5
,COUNT(1) AS field6
,MAX(field7)
,MIN(field8)
,SUM(field9 / 100)
,COUNT(field10)
,SUM(IF(field11  -500, 1, 0))
,MAX(field12)
,SUM(IF(field13 = 1, 1, 0))
,SUM(IF(field13 in (3,4,5,6,10,104,105,107), 1, 0))
,SUM(IF(field13 = 2012 , 1, 0))
,SUM(IF(field13 in (0,100,101,102,103,106), 1, 0))
 FROM table1 CL
JOIN table2 netw
ON CL.field15 = netw.id
WHERE
AND field3 IS NOT NULL
AND field4 IS NOT NULL
AND field5 IS NOT NULL
GROUP BY field1,field2,field3,field4, netw.field5


spark-submit --master spark://master:7077 *--driver-memory 20g
--executor-memory 60g* --class GMain project_2.10-1.0.jar
--driver-class-path '/opt/cloudera/parcels/CDH/lib/hive/lib/*'
--driver-java-options
'-Dspark.executor.extraClassPath=/opt/cloudera/parcels/CDH/lib/hive/lib/*'
2 ./error


Input data is 8GB in parquet format. Many times crash by *GC overhead*.
I've fixed spark.shuffle.partitions to 1024 but my worker nodes (with 128GB
RAM/node) is collapsed.

*Is it a query too difficult to Spark SQL? *
*Would It be better to do it in Spark?*
*Am I doing something wrong?*


Thanks
-- 


Regards.
Miguel Ángel


Error reading smallin in hive table with parquet format

2015-04-01 Thread Masf
Hi.

In Spark SQL 1.2.0, with HiveContext, I'm executing the following statement:

CREATE TABLE testTable STORED AS PARQUET AS
 SELECT
field1
 FROM table1

*field1 is SMALLINT. If table1 is in text format all it's ok, but if table1
is in parquet format, spark returns the following error*:



15/04/01 16:48:24 ERROR TaskSetManager: Task 26 in stage 1.0 failed 1
times; aborting job
Exception in thread main org.apache.spark.SparkException: Job aborted due
to stage failure: Task 26 in stage 1.0 failed 1 times, most recent failure:
Lost task 26.0 in stage 1.0 (TID 28, localhost):
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Short
at
org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaShortObjectInspector.get(JavaShortObjectInspector.java:41)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createPrimitive(ParquetHiveSerDe.java:251)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createObject(ParquetHiveSerDe.java:301)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.createStruct(ParquetHiveSerDe.java:178)
at
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe.serialize(ParquetHiveSerDe.java:164)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:123)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1$1.apply(InsertIntoHiveTable.scala:114)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:114)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:93)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



Thanks!
-- 


Regards.
Miguel Ángel


Re: Error in Delete Table

2015-03-31 Thread Masf
Hi Ted.

Spark 1.2.0 an Hive 0.13.1

Regards.
Miguel Angel.


On Tue, Mar 31, 2015 at 10:37 AM, Ted Yu yuzhih...@gmail.com wrote:

 Which Spark and Hive release are you using ?

 Thanks



  On Mar 27, 2015, at 2:45 AM, Masf masfwo...@gmail.com wrote:
 
  Hi.
 
  In HiveContext, when I put this statement DROP TABLE IF EXISTS
 TestTable
  If TestTable doesn't exist, spark returns an error:
 
 
 
  ERROR Hive: NoSuchObjectException(message:default.TestTable table not
 found)
at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
at
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
at
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
at
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1008)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90)
at com.sun.proxy.$Proxy22.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1000)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:942)
at
 org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3887)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:310)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at
 org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952)
at
 org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
at
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
at
 org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
at
 org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
at
 org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at
 org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98)
at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98)
at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at GeoMain$.HiveExecution(GeoMain.scala:96)
at GeoMain$.main(GeoMain.scala:17)
at GeoMain.main(GeoMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 
 
  Thanks!!
  --
 
 
  Regards.
  Miguel Ángel




-- 


Saludos.
Miguel Ángel


Re: Too many open files

2015-03-30 Thread Masf
I'm executing my application in local mode (with --master local[*]).

I'm using ubuntu and I've put session required  pam_limits.so into
/etc/pam.d/common-session
but it doesn't work

On Mon, Mar 30, 2015 at 4:08 PM, Ted Yu yuzhih...@gmail.com wrote:

 bq. In /etc/secucity/limits.conf set the next values:

 Have you done the above modification on all the machines in your Spark
 cluster ?

 If you use Ubuntu, be sure that the /etc/pam.d/common-session file
 contains the following line:

 session required  pam_limits.so


 On Mon, Mar 30, 2015 at 5:08 AM, Masf masfwo...@gmail.com wrote:

 Hi.

 I've done relogin, in fact, I put 'uname -n' and returns 100, but it
 crashs.
 I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file)


 Regards.
 Miguel Angel.

 On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Mostly, you will have to restart the machines to get the ulimit effect
 (or relogin). What operation are you doing? Are you doing too many
 repartitions?

 Thanks
 Best Regards

 On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote:

 Hi

 I have a problem with temp data in Spark. I have fixed
 spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next
 values:
 *   softnofile  100
 *   hardnofile  100
 In spark-env.sh set ulimit -n 100
 I've restarted the spark service and it continues crashing (Too many
 open files)

 How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2

 java.io.FileNotFoundException:
 /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
 (Too many open files)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
 at scala.Array$.fill(Array.scala:267)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID
 27, localhost): java.io.FileNotFoundException:
 /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
 (Too many open files)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
 at scala.Array$.fill(Array.scala:267)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)



 Thanks!
 --


 Regards.
 Miguel Ángel





 --


 Saludos.
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Too many open files

2015-03-30 Thread Masf
Hi

I have a problem with temp data in Spark. I have fixed
spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next
values:
*   softnofile  100
*   hardnofile  100
In spark-env.sh set ulimit -n 100
I've restarted the spark service and it continues crashing (Too many open
files)

How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2

java.io.FileNotFoundException:
/tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
(Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
at scala.Array$.fill(Array.scala:267)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID 27,
localhost): java.io.FileNotFoundException:
/tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
(Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
at scala.Array$.fill(Array.scala:267)
at
org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



Thanks!
-- 


Regards.
Miguel Ángel


Re: Too many open files

2015-03-30 Thread Masf
Hi.

I've done relogin, in fact, I put 'uname -n' and returns 100, but it
crashs.
I'm doing reduceByKey and SparkSQL mixed over 17 files (250MB-500MB/file)


Regards.
Miguel Angel.

On Mon, Mar 30, 2015 at 1:52 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Mostly, you will have to restart the machines to get the ulimit effect (or
 relogin). What operation are you doing? Are you doing too many
 repartitions?

 Thanks
 Best Regards

 On Mon, Mar 30, 2015 at 4:52 PM, Masf masfwo...@gmail.com wrote:

 Hi

 I have a problem with temp data in Spark. I have fixed
 spark.shuffle.manager to SORT. In /etc/secucity/limits.conf set the next
 values:
 *   softnofile  100
 *   hardnofile  100
 In spark-env.sh set ulimit -n 100
 I've restarted the spark service and it continues crashing (Too many open
 files)

 How can I resolve? I'm executing Spark 1.2.0 in Cloudera 5.3.2

 java.io.FileNotFoundException:
 /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
 (Too many open files)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
 at scala.Array$.fill(Array.scala:267)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 15/03/30 11:54:18 WARN TaskSetManager: Lost task 22.0 in stage 3.0 (TID
 27, localhost): java.io.FileNotFoundException:
 /tmp/spark-local-20150330115312-37a7/2f/temp_shuffle_c4ba5bce-c516-4a2a-9e40-56121eb84a8c
 (Too many open files)
 at java.io.FileOutputStream.open(Native Method)
 at java.io.FileOutputStream.init(FileOutputStream.java:221)
 at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:123)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:360)
 at
 org.apache.spark.util.collection.ExternalSorter$$anonfun$spillToPartitionFiles$1.apply(ExternalSorter.scala:355)
 at scala.Array$.fill(Array.scala:267)
 at
 org.apache.spark.util.collection.ExternalSorter.spillToPartitionFiles(ExternalSorter.scala:355)
 at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:211)
 at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:56)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)



 Thanks!
 --


 Regards.
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Error in Delete Table

2015-03-27 Thread Masf
Hi.

In HiveContext, when I put this statement DROP TABLE IF EXISTS TestTable
If TestTable doesn't exist, spark returns an error:



ERROR Hive: NoSuchObjectException(message:default.TestTable table not found)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
at
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1008)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:90)
at com.sun.proxy.$Proxy22.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1000)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:942)
at
org.apache.hadoop.hive.ql.exec.DDLTask.dropTableOrPartitions(DDLTask.java:3887)
at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:310)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1554)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1321)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1139)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:962)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:952)
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:305)
at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:276)
at
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult$lzycompute(commands.scala:58)
at
org.apache.spark.sql.hive.execution.DropTable.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.Command$class.execute(commands.scala:46)
at org.apache.spark.sql.hive.execution.DropTable.execute(commands.scala:51)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
at
org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
at org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
at org.apache.spark.sql.SchemaRDD.init(SchemaRDD.scala:108)
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:94)
at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98)
at GeoMain$$anonfun$HiveExecution$1.apply(GeoMain.scala:98)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at GeoMain$.HiveExecution(GeoMain.scala:96)
at GeoMain$.main(GeoMain.scala:17)
at GeoMain.main(GeoMain.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks!!
-- 


Regards.
Miguel Ángel


Re: Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Ok,

Thanks. Some web resource where I could check the functionality supported
by Spark SQL?

Thanks!!!

Regards.
Miguel Ángel.

On Thu, Mar 26, 2015 at 12:31 PM, Cheng Lian lian.cs@gmail.com wrote:

  We're working together with AsiaInfo on this. Possibly will deliver an
 initial version of window function support in 1.4.0. But it's not a promise
 yet.

 Cheng

 On 3/26/15 7:27 PM, Arush Kharbanda wrote:

 Its not yet implemented.

  https://issues.apache.org/jira/browse/SPARK-1442

 On Thu, Mar 26, 2015 at 4:39 PM, Masf masfwo...@gmail.com wrote:

 Hi.

  Are the Windowing and Analytics functions supported in Spark SQL (with
 HiveContext or not)? For example in Hive is supported
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics


  Some tutorial or documentation where I can see all features supported
 by Spark SQL?


  Thanks!!!
 --


 Regards.
 Miguel Ángel




  --

 [image: Sigmoid Analytics] http://htmlsig.com/www.sigmoidanalytics.com

 *Arush Kharbanda* || Technical Teamlead

 ar...@sigmoidanalytics.com || www.sigmoidanalytics.com





-- 


Saludos.
Miguel Ángel


Windowing and Analytics Functions in Spark SQL

2015-03-26 Thread Masf
Hi.

Are the Windowing and Analytics functions supported in Spark SQL (with
HiveContext or not)? For example in Hive is supported
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics


Some tutorial or documentation where I can see all features supported by
Spark SQL?


Thanks!!!
-- 


Regards.
Miguel Ángel


Re: Issues with SBT and Spark

2015-03-19 Thread Masf
Hi

Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala
2.11

Regards

On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan kvi...@vt.edu wrote:

 My current simple.sbt is

 name := SparkEpiFast

 version := 1.0

 scalaVersion := 2.11.4

 libraryDependencies += org.apache.spark % spark-core_2.11 % 1.2.1 %
 provided

 libraryDependencies += org.apache.spark % spark-graphx_2.11 % 1.2.1
 % provided

 While I do sbt package, it compiles successfully. But while running the
 application, I get
 Exception in thread main java.lang.NoSuchMethodError:
 scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;

 However, changing the scala version to 2.10.4 and updating the dependency
 lines appropriately resolves the issue (no exception).

 Could anyone please point out what I am doing wrong?




-- 


Saludos.
Miguel Ángel


Hive error on partitioned tables

2015-03-17 Thread Masf
Hi.

I'm running Spark 1.2.0. I have HiveContext and I execute the following
query:

select sum(field1 / 100) from table1 group by field2;

field1 in hive metastore is a smallint. The schema detected by hivecontext
is a int32:
fileSchema: message schema {

  optional int32 field1;
  
}

If table1 is an unpartitioned table it works well, however, if table1 is a
partitioned table it crashs in spark-submit. The error is the following:

java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Short
at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102)
at scala.math.Numeric$ShortIsIntegral$.toInt(Numeric.scala:72)
at
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234)
at
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:366)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365)
at
org.apache.spark.sql.catalyst.expressions.Expression.f1(Expression.scala:162)
at
org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:115)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365)
at
org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:109)
at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:90)
at
org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)
at
org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:72)
at
org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:526)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/03/17 10:42:51 ERROR Executor: Exception in task 1.0 in stage 3.0 (TID 5)
java.lang.ClassCastException: java.lang.Integer cannot be cast to
java.lang.Short
at scala.runtime.BoxesRunTime.unboxToShort(BoxesRunTime.java:102)
at scala.math.Numeric$ShortIsIntegral$.toInt(Numeric.scala:72)
at
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234)
at
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$6.apply(Cast.scala:234)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:366)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365)
at
org.apache.spark.sql.catalyst.expressions.Expression.f1(Expression.scala:162)
at
org.apache.spark.sql.catalyst.expressions.Divide.eval(arithmetic.scala:115)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:365)
at
org.apache.spark.sql.catalyst.expressions.Expression.n2(Expression.scala:109)
at org.apache.spark.sql.catalyst.expressions.Add.eval(arithmetic.scala:90)
at
org.apache.spark.sql.catalyst.expressions.Coalesce.eval(nullFunctions.scala:50)
at
org.apache.spark.sql.catalyst.expressions.MutableLiteral.update(literals.scala:72)
at
org.apache.spark.sql.catalyst.expressions.SumFunction.update(aggregates.scala:526)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
at
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:601)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at

Re: Spark SQL. Cast to Bigint

2015-03-17 Thread Masf
Hi Yin

With HiveContext works well.

Thanks!!!

Regars.
Miguel Angel.



On Fri, Mar 13, 2015 at 3:18 PM, Yin Huai yh...@databricks.com wrote:

 Are you using SQLContext? Right now, the parser in the SQLContext is quite
 limited on the data type keywords that it handles (see here
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala#L391)
 and unfortunately bigint is not handled in it right now. We will add
 other data types in there (
 https://issues.apache.org/jira/browse/SPARK-6146 is used to track it).
 Can you try HiveContext for now?

 On Fri, Mar 13, 2015 at 4:48 AM, Masf masfwo...@gmail.com wrote:

 Hi.

 I have a query in Spark SQL and I can not covert a value to BIGINT:
 CAST(column AS BIGINT) or
 CAST(0 AS BIGINT)

 The output is:
 Exception in thread main java.lang.RuntimeException: [34.62] failure:
 ``DECIMAL'' expected but identifier BIGINT found

 Thanks!!
 Regards.
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Parquet and repartition

2015-03-16 Thread Masf
Hi all.

When I specify the number of partitions and save this RDD in parquet
format, my app fail. For example

selectTest.coalesce(28).saveAsParquetFile(hdfs://vm-clusterOutput)

However, it works well if I store data in text

selectTest.coalesce(28).saveAsTextFile(hdfs://vm-clusterOutput)


My spark version is 1.2.1

Is this bug registered?


-- 


Saludos.
Miguel Ángel


Re: Parquet and repartition

2015-03-16 Thread Masf
Thanks Sean, I forgot it

The ouput error is the following:

java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to
org.apache.spark.sql.catalyst.types.decimal.Decimal
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:115)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/03/16 11:30:11 ERROR Executor: Exception in task 1.0 in stage 6.0 (TID
207)
java.lang.ClassCastException: scala.math.BigDecimal cannot be cast to
org.apache.spark.sql.catalyst.types.decimal.Decimal
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:115)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/03/16 11:30:11 INFO TaskSetManager: Starting task 2.0 in stage 6.0 (TID
208, localhost, ANY, 2878 bytes)
15/03/16 11:30:11 WARN TaskSetManager: Lost task 0.0 in stage 6.0 (TID 206,
localhost): java.lang.ClassCastException: scala.math.BigDecimal cannot be
cast to org.apache.spark.sql.catalyst.types.decimal.Decimal
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.consumeType(ParquetTableSupport.scala:359)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:328)
at
org.apache.spark.sql.parquet.MutableRowWriteSupport.write(ParquetTableSupport.scala:314)
at
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:115)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at org.apache.spark.sql.parquet.InsertIntoParquetTable.org
$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:308)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:325)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)



On Mon, Mar 16, 2015 at 12:19 PM, Sean Owen so...@cloudera.com wrote:

 You forgot to give any information about what fail means here.

 On Mon, Mar 16, 2015 at 11:11 AM, Masf masfwo...@gmail.com wrote:
  Hi all.
 
  When I specify the number of partitions and save this RDD in parquet
 format,
  my app fail. For example

Spark SQL. Cast to Bigint

2015-03-13 Thread Masf
Hi.

I have a query in Spark SQL and I can not covert a value to BIGINT:
CAST(column AS BIGINT) or
CAST(0 AS BIGINT)

The output is:
Exception in thread main java.lang.RuntimeException: [34.62] failure:
``DECIMAL'' expected but identifier BIGINT found

Thanks!!
Regards.
Miguel Ángel


Re: Read parquet folders recursively

2015-03-12 Thread Masf
Hi.

Thanks for your answers, but, to read parquet files is necessary to use
parquetFile method in org.apache.spark.sql.SQLContext,  is it true?

How can I combine your solution with the called to this method?

Thanks!!
Regards

On Thu, Mar 12, 2015 at 8:34 AM, Yijie Shen henry.yijies...@gmail.com
wrote:

 org.apache.spark.deploy.SparkHadoopUtil has a method:

 /**
* Get [[FileStatus]] objects for all leaf children (files) under the
 given base path. If the
* given path points to a file, return a single-element collection
 containing [[FileStatus]] of
* that file.
*/
   def listLeafStatuses(fs: FileSystem, basePath: Path): Seq[FileStatus] = {
 def recurse(path: Path) = {
   val (directories, leaves) = fs.listStatus(path).partition(_.isDir)
   leaves ++ directories.flatMap(f = listLeafStatuses(fs, f.getPath))
 }

 val baseStatus = fs.getFileStatus(basePath)
 if (baseStatus.isDir) recurse(basePath) else Array(baseStatus)
   }

 —
 Best Regards!
 Yijie Shen

 On March 12, 2015 at 2:35:49 PM, Akhil Das (ak...@sigmoidanalytics.com)
 wrote:

  Hi

 We have a custom build to read directories recursively, Currently we use
 it with fileStream like:

  val lines = ssc.fileStream[LongWritable, Text,
 TextInputFormat](/datadumps/,
   (t: Path) = true, true, *true*)


 Making the 4th argument true to read recursively.


 You could give it a try
 https://s3.amazonaws.com/sigmoidanalytics-builds/spark-1.2.0-bin-spark-1.2.0-hadoop2.4.0.tgz

  Thanks
 Best Regards

 On Wed, Mar 11, 2015 at 9:45 PM, Masf masfwo...@gmail.com wrote:

 Hi all

 Is it possible to read recursively folders to read parquet files?


 Thanks.

 --


 Saludos.
 Miguel Ángel





-- 


Saludos.
Miguel Ángel


Read parquet folders recursively

2015-03-11 Thread Masf
Hi all

Is it possible to read recursively folders to read parquet files?


Thanks.

-- 


Saludos.
Miguel Ángel