spark kudu issues

2018-06-20 Thread Pietro Gentile
Hi all,

I am currently evaluating using Spark with Kudu.
So I am facing the following issues:

1) If you try to DELETE a row with a key that is not present on the table
you will have an Exception like this:

java.lang.RuntimeException: failed to write N rows from DataFrame to Kudu;
sample errors: Not found: key not found (error 0)

2) If you try to DELETE a row using a subset of a table key you will face
the following:

Caused by: java.lang.RuntimeException: failed to write N rows from
DataFrame to Kudu; sample errors: Invalid argument: No value provided for
key column:

The use cases presented above are correctly working if you interact with
kudu using Impala.

Any suggestions to overcome these limitation?

Thanks.
Best Regards

Pietro


Fwd: Spark CacheManager Thread-safety

2016-05-20 Thread Pietro Gentile
Hi all,

I have a series of doubts about CacheManager used by SQLContext to cache
DataFrame.

My use case requires different threads persisting/reading dataframes
cuncurrently. I realized using spark that persistence really does not work
in parallel mode.

I would like it if I'm persisting a data frame, another user should be able
to persist a different DF and not wait for the first to finish.

Is there any way to implement this scenario?

Thank in advance,
Pietro.


Spark CacheManager Thread-safety

2016-05-20 Thread Pietro Gentile
Hi all,

I have a series of doubts about CacheManager used by SQLContext to cache
DataFrame.

My use case requires different threads persisting/reading dataframes
cuncurrently. I realized using spark that persistence really does not work
in parallel mode.

I would like it if I'm persisting a data frame, another user should be able
to persist a different DF and not wait for the first to finish.

Is there any way to implement this scenario?

Thank in advance,
Pietro.


Spark Web UI issue

2016-05-06 Thread Pietro Gentile
Hi all,


I have a spark application running to which I submit jobs continuosly.
These job use different instances of sqlContext. So the web ui of
application starts to fill up more and more with this instance.
Is there any way to prevent this? I don't want to see created sql context
in the web ui.

thanks in advance,

PG


Spark Cache Eviction

2016-02-22 Thread Pietro Gentile
Hi all,

Is there a way to prevent eviction of the RDD from SparkContext ?
I would not use the cache with its default behavior (LRU). I would
unpersist manually RDD cached in memory/disk.


Thanks in advance,

Pietro.

Questa e-mail è stata inviata da un computer privo di virus protetto da
Avast.
www.avast.com 
<#DDB4FAA8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


NOT IN in Spark SQL

2015-09-03 Thread Pietro Gentile
Hi all,

How can I do to use the "NOT IN" clause in Spark SQL 1.2 ??

He continues to give me syntax errors. But the question is correct in SQL.

Thanks in advance,
Best regards,

Pietro.


SPARK REMOTE DEBUG

2015-06-29 Thread Pietro Gentile
Hi all,

What is the best way to remotely debug, with breakpoints, spark apps?


Thanks in advance,
Best regards!

Pietro


Spark SQL and Streaming Results

2015-06-05 Thread Pietro Gentile
Hi all,


what is the best way to perform Spark SQL queries and obtain the result
tuplas in a stremaing way. In particullar, I want to aggregate data and
obtain the first and incomplete results in a fast way. But it should be
updated until the aggregation be completed.

Best Regards.


Hardware provisioning for Spark SQl

2015-04-29 Thread Pietro Gentile
Hi all,

I have to estimate resource requirements for my hadoop/spark cluster. In
particular, i have to query about 100tb of hbase table to do aggregation
with spark sql.

What is, approximately, the most suitable cluster configuration for my use
case? In order to query data in a fast way. At last i have to develope
an online analytical application on these data.

I would like to know what kind of nodes i have to configure to achieve the
goal. How many RAM, cores, disks these nodes should have??

Thanks in advance,
Best regards,

Pietro


[spark-jobserver] Submit Job in yarn-cluster mode (?)

2015-01-14 Thread Pietro Gentile
Hi all,


I'm able to submit spark jobs through spark-jobserver. But this allows to
use spark only in yarn-client mode. I want to use spark also in
yarn-cluster mode but jobserver does not allow it, like says in the README
file https://github.com/spark-jobserver/spark-jobserver.


Could you tell me how use Spark as Service in yarn-cluster mode??

Thanks in advance and Best Regards,


Pietro Gentile


Spark 1.1.0 and HBase: Snappy UnsatisfiedLinkError

2014-11-25 Thread Pietro Gentile
Hi everyone,

I deployed Spark 1.1.0 and I m trying to use it with spark-job-server 0.4.0 
(https://github.com/ooyala/spark-jobserver).
I previously used Spark 1.0.2 and had no problems with it. I want to use the 
newer version of Spark (and Spark SQL) to create the SchemaRDD programmatically.

The CLASSPATH variable was properly setted because the following code works 
perfectly (from https://spark.apache.org/docs/1.1.0/sql-programming-guide.html 
https://spark.apache.org/docs/1.1.0/sql-programming-guide.html but with input 
form base table).
 
But when I try to put this in the override def runJob(sc:SparkContext, 
jobConfig: Config): Any = ??? method, this not work. The exception is:

java.lang.UnsatisfiedLinkError: 
org.xerial.snappy.SnappyNative.maxCompressedLength(I)I
at org.xerial.snappy.SnappyNative.maxCompressedLength(Native Method)
at org.xerial.snappy.Snappy.maxCompressedLength(Snappy.java:320)
at 
org.xerial.snappy.SnappyOutputStream.init(SnappyOutputStream.java:79)
at 
org.apache.spark.io.SnappyCompressionCodec.compressedOutputStream(CompressionCodec.scala:125)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:207)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:83)
at 
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:68)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:36)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:809)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:829)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:769)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:772)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$submitStage$4.apply(DAGScheduler.scala:771)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:771)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:753)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1360)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

This exception occurs at the line  val peopleRows = new NewHadoopRDD” when try 
to read rows from HBase (0.98). I execute this code in both in Scala and Java. 

Any ideas?? From what could it depend?



CODE

// sc is an existing SparkContext.
val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create an RDD
val people = sc.textFile(examples/src/main/resources/people.txt)

// The schema is encoded in a string
val schemaString = name age

// Import Spark SQL data types and Row.
import org.apache.spark.sql._

// Generate the schema based on the string of schema
val schema =
  StructType(
schemaString.split( ).map(fieldName = StructField(fieldName, StringType, 
true)))

val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, “people)

val peopleRows = new NewHadoopRDD(sc,
  classOf[TableInputFormat],
  classOf[ImmutableBytesWritable],
  classOf[Result])


// Convert records of the RDD (people) to Rows.
val rowRDD = peopleRows.map // create Rows (name,age)

// Apply the schema to the RDD.
val peopleSchemaRDD = sqlContext.applySchema(rowRDD, schema)

// Register the SchemaRDD as a table.
peopleSchemaRDD.registerTempTable(people)

// SQL statements can be run by using the sql methods provided by sqlContext.
val results = sqlContext.sql(SELECT name FROM people)

// The results of SQL queries are SchemaRDDs and support all the normal RDD 
operations.
// The columns of a row in the result can be accessed by ordinal.
results.map(t = Name:  +