Hi,
I'm loading a json file into a RDD and then save that RDD as parquet.
One of the fields is a map of keys and values but it is being translated and
stored as a struct.
How can I convert the field into a map?
Thanks,
Daniel
Hello,
I'm writing a process that ingests json files and saves them a parquet
files.
The process is as such:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val jsonRequests=sqlContext.jsonFile(/requests)
val parquetRequests=sqlContext.parquetFile(/requests_parquet)
Very cool thank you!
On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote:
You can also insert into existing tables via .insertInto(tableName,
overwrite). You just have to import sqlContext._
On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote:
Hello
mapper.registerModule(DefaultScalaModule)
val myMap = mapper.readValue[Map[String,String]](json)
myMap
})
Thanks
Best Regards
On Wed, Nov 19, 2014 at 11:01 AM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I'm loading a json file into a RDD and then save
Hello,
I'm loading and saving json files into an existing directory with parquet files
using the insertIntoTable method.
If the method fails for some reason (differences in the schema in my case), the
_metadata file of the parquet dir is automatically deleted, rendering the
existing parquet
You can save the results as parquet file or as text file and created a hive
table based on these files
Daniel
On 20 בנוב׳ 2014, at 08:01, akshayhazari akshayhaz...@gmail.com wrote:
Sorry about the confusion I created . I just have started learning this week.
Silly me, I was actually
Hi,
I'm loading a bunch of json files and there seems to be problems with
specific files (either schema changes or incomplete files).
I'd like to catch the inconsistent files but I'm not sure how to do it.
This is the exception I get:
14/11/20 00:13:49 INFO cluster.YarnClientClusterScheduler:
and not doing anything with
them
}
any ideas?
Thanks,
Daniel
On Thu, Nov 20, 2014 at 10:20 AM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I'm loading a bunch of json files and there seems to be problems with
specific files (either schema changes or incomplete files).
I'd like to catch
Hi Guys,
I really need your help with this:
I'm loading a bunch of json files uploaded via webhdfs, some of them have
some incosistencies (the json ends mid-line for example) and that causes my
whole application to fail.
How can I continue processing valid json records without failing the
will be null.
On Thu, Nov 20, 2014 at 7:34 AM, Daniel Haviv danielru...@gmail.com
wrote:
Hi Guys,
I really need your help with this:
I'm loading a bunch of json files uploaded via webhdfs, some of them have
some incosistencies (the json ends mid-line for example) and that causes my
whole
There is probably a better way to do it but I would register both as temp
tables and then join them via SQL.
BR,
Daniel
On 20 בנוב׳ 2014, at 23:53, Harihar Nahak hna...@wynyardgroup.com wrote:
I've similar type of issue, want to join two different type of RDD in one RDD
file1.txt
Hi,
Everytime I start the spark-shell I encounter this message:
14/11/18 00:27:43 WARN hdfs.BlockReaderLocal: The short-circuit local reads
feature cannot be used because libhadoop cannot be loaded.
Any idea how to overcome it ?
the short-circuit feature is a big perfomance boost I don't want to
Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000
Hi,
I have a column in my schemaRDD that is a map but I'm unable to convert it
to a map.. I've tried converting it to a Tuple2[String,String]:
val converted = jsonFiles.map(line= {
line(10).asInstanceOf[Tuple2[String,String]]})
but I get ClassCastException:
14/11/23 11:51:30 WARN
Hi,
I want to test Kryo serialization but when starting spark-shell I'm hitting
the following error:
java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer
the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking
it up.
Thanks for your help,
Daniel
The problem was I didn't use the correct class name, it should
be org.apache.spark.*serializer*.KryoSerializer
On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I want to test Kryo serialization but when starting spark-shell I'm
hitting the following error
Hi,
I'm selecting columns from a json file, transform some of them and would
like to store the result as a parquet file but I'm failing.
This is what I'm doing:
val jsonFiles=sqlContext.jsonFile(/requests.loading)
jsonFiles.registerTempTable(jRequests)
val clean_jRequests=sqlContext.sql(select
pipeline work, we have been considering
adding something like:
def modifyColumn(columnName, function)
Any comments anyone has on this interface would be appreciated!
Michael
On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote:
Hi,
I'm selecting columns from
Hi,
I'm trying to start the thrift server but failing:
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/tez/dag/api/SessionNotRunning
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353)
at
are rows inside of rows. If you
wan to return a struct from a UDF you can do that with a case class.
On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com
wrote:
Thank you.
How can I address more complex columns like maps and structs?
Thanks again!
Daniel
On 25 בנוב׳ 2014
Prepend file:// to the path
Daniel
On 26 בנוב׳ 2014, at 20:15, firemonk9 dhiraj.peech...@gmail.com wrote:
When I am running spark locally, RDD saveAsObjectFile writes the file to
local file system (ex : path /data/temp.txt)
and
when I am running spark on YARN cluster, RDD
Hi,
I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
get the following exception:
14/12/09 06:54:24 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI'
on port 4040.
14/12/09
Hi,
I've built spark successfully with maven but when I try to run spark-shell I
get the following errors:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/deploy/SparkSubmit
Caused by:
On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I've built spark successfully with maven but when I try to run
spark-shell I get the following errors:
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Exception in thread main
Best Regards
On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv danielru...@gmail.com
wrote:
That's the first thing I tried... still the same error:
hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib
hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1
hdfs@ams-rsrv01:/tmp/spark
/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-rdbms-3.2.1.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-api-jdo-3.2.1.jar
Thanks
Best Regards
On Tue, Dec 16, 2014 at 10:33 PM, Daniel Haviv danielru...@gmail.com
wrote:
Same here...
# jar tf lib
it and see what are the contents?
On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote:
I've added every jar in the lib dir to my classpath and still no luck:
CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core
are the contents?
On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote:
I've added every jar in the lib dir to my classpath and still no luck:
CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-3.2.2.jar:/tmp/spark/spark
problem..
2014-12-09 22:58 GMT+08:00 Daniel Haviv danielru...@gmail.com:
Hi,
I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
get the following exception:
14/12/09 06:54:24 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/12/09
Quoting Michael:
Predicate push down into the input format is turned off by default because
there is a bug in the current parquet library that null pointers when there are
full row groups that are null.
https://issues.apache.org/jira/browse/SPARK-4258
You can turn it on if you want:
Hi,
We wrote a spark steaming app that receives file names on HDFS from Kafka
and opens them using Hadoop's libraries.
The problem with this method is that I'm not utilizing data locality
because any worker might open any file without giving precedence to data
locality.
I can't open the files
Hi,
We are designing a solution which pulls file paths from Kafka and for the
current stage just counts the lines in each of these files.
When running the code it fails on:
Exception in thread main org.apache.spark.SparkException: Task not
serializable
at
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
by step guide in how to setup and use Spark in
HDInsight.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/
Jacob
From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: Thursday, June 25, 2015 3:19 PM
To: Silvio Fiorito
Cc: user
Hi,
I'm trying to use spark over Azure's HDInsight but the spark-shell fails
when starting:
java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at
library is up
at http://spark-packages.org/package/granturing/spark-power-bi the Event Hubs
library should be up soon.
Thanks,
Silvio
From: Daniel Haviv
Date: Thursday, June 25, 2015 at 1:37 PM
To: user@spark.apache.org
Subject: Using Spark on Azure Blob Storage
Hi,
I'm trying to use
Hi,
I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start
the spark-shell it always start with HiveContext.
How can I disable the HiveContext from being initialized automatically ?
Thanks,
Daniel
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage
jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at
.
On 3 Jul 2015, at 08:33, Daniel Haviv daniel.ha...@veracity-group.com
wrote:
Thanks
I was looking for a less hack-ish way :)
Daniel
On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
With binary i think it might not be possible, although if you can download
/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023
which initializes the SQLContext.
Thanks
Best Regards
On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I've downloaded the pre-built binaries
at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage
jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass
Hi,
Is it possible to start the Spark SQL thrift server from with a streaming app
so the streamed data could be queried as it's goes in ?
Thank you.
Daniel
Hi,
I'd like to start a service with each Spark Executor upon initalization and
have the disributed code reference that service locally.
What I'm trying to do is to invoke torch7 computations without reloading
the model for each row by starting Lua http handler that will recieve http
requests for
) :: (2,world)
::Nil).toDF.cache().registerTempTable(t)
HiveThriftServer2.startWithContext(sqlContext)
}
Again, I'm not really clear what your use case is, but it does sound like
the first link above is what you may want.
-Todd
On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv
daniel.ha
Hi,
My data is constructed from a lot of small files which results in a lot of
partitions per RDD.
Is there some way to locally repartition the RDD without shuffling so that
all of the partitions that reside on a specific node will become X
partitions on the same node ?
Thank you.
Daniel
executors * 10, but I’m still
trying to figure out the
optimal number of partitions per executor.
To get the number of executors,
sc.getConf.getInt(“spark.executor.instances”,-1)
Cheers,
Doug
On Jul 20, 2015, at 5:04 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
My data
to be aware of that as you decide when to call
coalesce.
Thanks,
Silvio
From: Daniel Haviv
Date: Monday, July 20, 2015 at 4:59 PM
To: Doug Balog
Cc: user
Subject: Re: Local Repartition
Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting
I will
Thank you.
> On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote:
>
> Please open a JIRA?
>
>
> Date: Mon, 26 Oct 2015 15:32:42 +0200
> Subject: HiveContext ignores ("skip.header.line.count"="1")
> From: daniel.ha...@veracity-group.com
> To: user@spark.apache.org
Hi,
I have a csv table in Hive which is configured to skip the header row using
TBLPROPERTIES("skip.header.line.count"="1").
When querying from Hive the header row is not included in the data, but
when running the same query via HiveContext I get the header row.
I made sure that HiveContext sees
microreflection,
dsParamLine['unerroreds'] unerroreds,
dsParamLine['corrected'] corrected,
dsParamLine['uncorrectables'] uncorrectables,
from_unixtime(cast(datats/1000 as bigint),'MMdd')
day_ts,
cmtsid
""")
On Thu,
indow on the Dstream) - and it works as expected.
>
> Check this out for code example:
> https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala
>
> -adrian
>
> From: Daniel Havi
Hi,
As things that run inside foreachRDD run at the driver, does that mean that
if we use SQLContext inside foreachRDD the data is sent back to the driver
and only then the query is executed or is it executed at the executors?
Thank you.
Daniel
out soon.
>
>
>
> Hao
>
>
>
> *From:* Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
> *Sent:* Friday, October 9, 2015 3:08 AM
> *To:* user
> *Subject:* Re: Insert via HiveContext is slow
>
>
>
> Forgot to mention that my insert is a multi table insert
Hi,
We are inserting streaming data into a hive orc table via a simple insert
statement passed to HiveContext.
When trying to read the files generated using Hive 1.2.1 we are getting NPE:
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at
Hi,
I'm inserting into a partitioned ORC table using an insert sql statement
passed via HiveContext.
The performance I'm getting is pretty bad and I was wondering if there are
ways to speed things up.
Would saving the DF like this
...@hortonworks.com
wrote:
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob
storage jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
Hi,
I'm running Spark 1.4 on Azure.
DataFrame's insertInto fails, but when saveAsTable works.
It seems like some issue with accessing Azure's blob storage but that
doesn't explain why one type of write works and the other doesn't.
This is the stack trace:
Caused by:
eDirectStream[AvroKey[GenericRecord],
> NullWritable, AvroKeyInputFormat[GenericRecord]](..)
>
> val avroData = avroStream.map(x => x._1.datum().toString)
>
>
> Thanks
> Best Regards
>
>> On Thu, Sep 3, 2015 at 6:17 PM, Daniel Haviv
>> <daniel.ha...@verac
Hi,
I'm reading messages from Kafka where the value is an avro file.
I would like to parse the contents of the message and work with it as a
DataFrame, like with the spark-avro package but instead of files, pass it a
RDD.
How can this be achieved ?
Thank you.
Daniel
Hi,
are there any free JDBC drivers for thrift ?
The only ones I could find are Simba's which require a license.
Thank,
Daniel
l work outside of EMR, but it's worth a try.
>
> I've also used the ODBC driver from Hortonworks
> <http://hortonworks.com/products/releases/hdp-2-2/#add_ons>.
>
> Regards,
> Dan
>
> On Wed, Sep 16, 2015 at 8:34 AM, Daniel Haviv <
> daniel.ha...@veracity-group.com> w
Hi,
I have a DStream which is a stream of RDD[String].
How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ?
Thank you.
Daniel
Hi,
I'm trying to use KafkaUtils.createDirectStream to read avro messages from
Kafka but something is off with my type arguments:
val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord],
GenericRecord, NullWritable, AvroInputFormat[GenericRecord]](ssc,
kafkaParams, topicSet)
I'm
I tried but I'm getting the same error (task not serializable)
> On 25 בספט׳ 2015, at 20:10, Ted Yu <yuzhih...@gmail.com> wrote:
>
> Is the Schema.parse() call expensive ?
>
> Can you call it in the closure ?
>
>> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Ha
Hi,
I'm getting a NotSerializableException even though I'm creating all the my
objects from within the closure:
import org.apache.avro.generic.GenericDatumReader
import java.io.File
import org.apache.avro._
val orig_schema = Schema.parse(new File("/home/wasabi/schema"))
val READER = new
tty inefficient on almost any file system
>
> El martes, 22 de septiembre de 2015, Daniel Haviv
> <daniel.ha...@veracity-group.com> escribió:
>> Hi,
>> We are trying to load around 10k avro files (each file holds only one
>> record) using spark-avro but it take
Hi,
We are trying to load around 10k avro files (each file holds only one
record) using spark-avro but it takes over 15 minutes to load.
It seems that most of the work is being done at the driver where it created
a broadcast variable for each file.
Any idea why is it behaving that way ?
Thank
Hi,
I'm loading a 1000 files using the spark-avro package:
val df = sqlContext.read.avro(*"/incoming/"*)
When I'm performing an action on this df it seems like for each file a
broadcast is being created and is sent to the workers (instead of the
workers reading their data-local files):
scala>
Hi,
I have an application running a set of transformations and finishes
with saveAsTextFile.
Out of 80 tasks all finish pretty fast but one that just hangs and
outputs these message to STDERR:
5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82
spilling in-memory map of 4.0 GB to
Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable
to load AWS credentials from any provider in the chain)
I
parameter to switch broadcast
> method.
>
> Can you describe the issues with torrent broadcast in more detail ?
>
> Which version of Spark are you using ?
>
> Thanks
>
> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv <
> daniel.ha...@veracity-group.com> wrote:
>
&g
Hi,
I'm wrapped the following code into a jar:
val test = sc.parallelize(Seq(("daniel", "a"), ("daniel", "b"), ("test", "1)")))
val agg = test.groupByKey()
agg.collect.foreach(r=>{println(r._1)})
The result of groupByKey is an empty RDD, when I'm trying the same
code using the spark-shell it's
I'm using EC2 instances
Thank you.
Daniel
> On 9 Jun 2016, at 16:49, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
> are you using EC2 instances or local cluster behind firewall.
>
>
> Regards,
> Gourav Sengupta
>
>> On
Hi,
I've set these properties both in core-site.xml and hdfs-site.xml with no luck.
Thank you.
Daniel
> On 9 Jun 2016, at 01:11, Steve Loughran <ste...@hortonworks.com> wrote:
>
>
>> On 8 Jun 2016, at 16:34, Daniel Haviv <daniel.ha...@veracity-group.com>
>>
Hi,
Our application is failing due to issues with the torrent broadcast, is
there a way to switch to another broadcast method ?
Thank you.
Daniel
> Anyway, I couldn't find a root cause only from the stacktraces...
>
> // maropu
>
>
>
>
> On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv <
> daniel.ha...@veracity-group.com> wrote:
>
>> Hi,
>> I've set spark.broadcast.factory to
>> org.ap
gt;
> On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> How about using `transient` annotations?
>>
>> // maropu
>>
>> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv <
>> daniel.ha...@veracity-group.
Hi,
I want to to use an accumulableCollection of type mutable.ArrayBuffer[String
] to return invalid records detected during transformations but I don't
quite get it's type:
val errorAccumulator: Accumulable[ArrayBuffer[String], String] =
sc.accumulableCollection(mutable.ArrayBuffer[String]())
Hi,
I'm trying to use the Spark Sink with Flume but it seems I'm missing some
of the dependencies.
I'm running the following code:
./bin/spark-shell --master yarn --jars
Hi,
shouldn't groupByKey be avoided (
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)
?
Thank you,.
Daniel
On Wed, Mar 30, 2016 at 9:01 AM, Akhil Das
wrote:
> Isn't it what
Hi,
We have a hive table which gets data written to it by two partition keys,
day and hour.
We would like to stream the incoming files assince fileStream can only
listen on one directory we start a streaming job on the latest partition
and every hour kill it and start a new one on a newer
lugin to work? I have CDH 5.5.2 installed which
> comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
>
> Thanks,
> Ben
>
>> On Apr 24, 2016, at 1:43 AM, Daniel Haviv <daniel.ha...@veracity-group.com>
>> wrote:
>>
>> Hi,
>>
Hi,
I tried saving DF to HBase using a hive table with hbase storage handler and
hiveContext but it failed due to a bug.
I was able to persist the DF to hbase using Apache Pheonix which was pretty
simple.
Thank you.
Daniel
> On 21 Apr 2016, at 16:52, Benjamin Kim wrote:
>
Hi,
I'm running a very simple job (textFile->map->groupby->count) and hitting
this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception
when running on yarn-client and not in local mode:
16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1,
Hi,
I have a DStream I'd like to collect and broadcast it's values.
To do so I've created a mutable HashMap which i'm filling with foreachRDD
but when I'm checking it, it remains empty. If I use ArrayBuffer it works
as expected.
This is my code:
val arr =
eption. Also, you can’t update the
> value of a broadcast variable, since it’s immutable.
>
> Thanks,
> Silvio
>
> From: Daniel Haviv <daniel.ha...@veracity-group.com>
> Date: Sunday, May 15, 2016 at 6:23 AM
> To: user <user@spark.apache.org>
> Subject: &qu
How come that for the first() function it calculates an updated value and
for collect it doesn't ?
On Sun, May 8, 2016 at 4:17 PM, Ted Yu wrote:
> I don't think so.
> RDD is immutable.
>
> > On May 8, 2016, at 2:14 AM, Sisyphuss wrote:
> >
> >
Hi,
How can the DirectOutputCommitter be utilized for writing ORC files?
I tried setting it via:
sc.getConf.set("spark.hadoop.mapred.output.committer.class","com.veracity-group.datastage.directorcwriter")
But I can still see a _temporary directory being used when I save my
dataframe as ORC.
Hi,
I have a streaming application which uses a broadcast variable which I
populate from a database.
I would like every once in a while (or even every batch) to update/replace
the broadcast variable with the latest data from the database.
Only way I found online to do this is this "hackish" way (
Hi,
I'm running Spark using IntelliJ on Windows and even though I set
HADOOP_CONF_DIR it does not affect the contents of sc.hadoopConfiguration.
Anybody encountered it ?
Thanks,
Daniel
Or use mapWithState
Thank you.
Daniel
> On 16 Jul 2016, at 03:02, RK Aduri wrote:
>
> You can probably define sliding windows and set larger batch intervals.
>
>
>
> --
> View this message in context:
>
Hi,
I'm using Spark 2.1.0 on Zeppelin.
I can successfully create a table but when I try to select from it I fail:
spark.sql("create table foo (name string)")
res0: org.apache.spark.sql.DataFrame = []
spark.sql("select * from foo")
org.apache.spark.sql.AnalysisException:
Hive support is required
Hi,
I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] :
val kafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=>
{
ue, Nov 8, 2016 at 7:46 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] :
>
> val kafkaStream = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc
Scratch that, it's working fine.
Thank you.
On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> I should have used transform instead of map
>
> val x: DStream[(String, Record)] =
> kafkaStream.transform(x=>{x
Hi,
How can I utilize mapWithState and DataFrames?
Right now I stream json messages from Kafka, update their state, output the
updated state as json and compose a dataframe from it.
It seems inefficient both in terms of processing and storage (a long string
instead of a compact DF).
Is there a
Hi,
I have a stateful streaming app where I pass a rather large initialState
RDD at the beginning.
No matter to how many partitions I divide the stateful stream I keep
failing on OOM or Java heap space.
Is there a way to make it more resilient?
how can I control it's storage level?
This is
Hi,
I have a case where I use partitionBy to write my DF using a calculated
column, so it looks somethings like this:
val df = spark.sql("select *, from_unixtime(ts, 'MMddH')
partition_key from mytable")
df.write.partitionBy("partition_key").orc("/partitioned_table")
df is 8 partitions in
Hi,
I have a fairly simple stateful streaming job that suffers from high GC and
it's executors are killed as they are exceeding the size of the requested
container.
My current executor-memory is 10G, spark overhead is 2G and it's running
with one core.
At first the job begins running at a rate
Hi,
Is it possible to use mapWithState without checkpointing at all ?
I'd rather have the whole application fail, restart and reload an
initialState RDD than pay for checkpointing every 10 batches.
Thank you,
Daniel
1 - 100 of 112 matches
Mail list logo