Hi Phil,
thanks for the Jira, I will try to take a look asap.
Regards
JB
On 10/19/2015 11:07 PM, Phil Kallos wrote:
I am currently trying a few code changes to see if I can squash this
error. I have created https://issues.apache.org/jira/browse/SPARK-11193
to track progress, hope that is
This seems to be set using hive.exec.scratchdir, is that set?
hdfsSessionPath = new Path(hdfsScratchDirURIString, sessionId);
createPath(conf, hdfsSessionPath, scratchDirPermission, false, true);
conf.set(HDFS_SESSION_PATH_KEY, hdfsSessionPath.toUri().toString());
On 20 October 2015
The thread-local things does not work well with PySpark, because the
thread used by PySpark in JVM could change over time, SessionState
could be lost.
This should be fixed in master by https://github.com/apache/spark/pull/8909
On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote:
Kali
>> can I cache a RDD in memory for a whole day ? as of I know RDD will get
empty once the spark code finish executing (correct me if I am wrong).
Spark can definitely be used as a replacement for in memory databases for
certain use cases. Spark RDDs are not shared amongst contextss. You
If you look at the source code you’ll see that this is merely a convenience
function on PairRDDs - only interesting detail is that it uses a mutable
HashMap to optimize creating maps with many keys. That being said, .collect()
is called anyway.
Hi all,I wonder if there is a way to create some child streaming while using
spark streaming?For example, I create a netcat main stream, read data from a
socket, then create 3 different child streams on the main stream,in stream1, we
do fun1 on the input data then print result to screen;in
Hi All,
Is there any performance impact when I use collectAsMap on my RDD instead of
rdd.collect().toMap ?
I have a key value rdd and I want to convert to HashMap as far I know
collect() is not efficient on large data sets as it runs on driver can I use
collectAsMap instead is there any
One region
-Original Message-
From: "Ted Yu"
Sent: 20-10-2015 15:01
To: "Amit Singh Hora"
Cc: "user"
Subject: Re: Spark opening to many connection with zookeeper
How many regions do your table have ?
Which hbase
You can create as many functional derivates of your original stream by
using transformations. That's exactly the model that Spark Streaming offers.
In your example, that would become something like:
val stream = ssc.socketTextStream("localhost", )
val stream1 = stream.map(fun1)
val stream2 =
Hello,
when upgrading to spark 1.5.1 from 1.4.1 the following code crashed on
runtime. It is mainly used to parse HiveQL queries and check that they are
valid.
package org.apache.spark.sql.hive
val sql = "CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo` INT)
PARTITIONED BY (year INT,
How many regions do your table have ?
Which hbase release do you use ?
Cheers
On Tue, Oct 20, 2015 at 12:32 AM, Amit Singh Hora
wrote:
> Hi All ,
>
> My spark job started reporting zookeeper errors after seeing the zkdumps
> from Hbase master i realized that there are N
Hi Michael,
thank you for your reply, I tried to use explode before, but didn't had
much success. But I didn't find an example where it is used like you
suggested.
I applied your suggestion and it works like a charm!
Thanks,
Nuno
Nuno Carvalho
Software Engineer - Big Data & Analytics
On 19
Hi Phil,
did you see my comments in the Jira ?
Can you provide an update in the Jira please ? Thanks !
Regards
JB
On 10/19/2015 11:07 PM, Phil Kallos wrote:
I am currently trying a few code changes to see if I can squash this
error. I have created
As Francois pointed out, you are encountering a classic small file
anti-pattern. One solution I used in the past is to wrap all these small binary
files into a sequence file or avro file. For example, the avro schema can have
two fields: filename: string and binaryname:byte[]. Thus your file is
coalesce without a shuffle? it shouldn't be an action. It just treats many
partitions as one.
On Tue, Oct 20, 2015 at 1:00 PM, t3l wrote:
>
> I have dataset consisting of 5 binary files (each between 500kb and
> 2MB). They are stored in HDFS on a Hadoop cluster. The
I have dataset consisting of 5 binary files (each between 500kb and
2MB). They are stored in HDFS on a Hadoop cluster. The datanodes of the
cluster are also the workers for Spark. I open the files as a RDD using
sc.binaryFiles("hdfs:///path_to_directory").When I run the first action that
Hi All,
When processing streams of data (with batch inter val 1 sec), there is
possible case of Concurrency issue. i.e two messages M1 and M2 (updated
version of M1) with same key are processed by 2 threads in parallel.
To resolve this concurrency issue, I am applying Hash Partitioner on RDD.
Hi All
We run an application with version 1.4.1 standalone mode. We saw two tasks in
one stage which runs very slow seems it is hang. We know that the JobScheduler
have the function to assign the straggle task to another node. But what we saw
it does not reassign. So we want to know is there
Hi All ,
My spark job started reporting zookeeper errors after seeing the zkdumps
from Hbase master i realized that there are N number of connection being
made from the nodes where worker of spark are running i believe some how
the connections are not getting closed that is leading to error
Hi Deenar,
Thanks for your valuable inputs
Here is a situation, if a Source Table does not have any such column(unique
values,numeric and sequential) which is suitable as Partition Column to be
specified for JDBCRDD Constructor or DataSource API.How to proceed further
on this scenario and also
Hi,
I couldn't access the following URL (404):
http://hbase.apache.org/book.html
The above is linked from http://hbase.apache.org
Where can I find the refguide ?
Thanks
Can you take a look at example 37 on page 225 of:
http://hbase.apache.org/apache_hbase_reference_guide.pdf
You can use the following method of Table:
void put(List puts) throws IOException;
After the put() returns, the connection is closed.
Cheers
On Tue, Oct 20, 2015 at 2:40 AM, Amit Hora
Figured it out.
Fix change the compression codec and set LD_LIBRARY_PATH.
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/apache/hadoop/lib/native/
-sh-4.1$ ./bin/spark-shell
import org.apache.hadoop.io.Text
import org.codehaus.jackson.map.ObjectMapper
import scala.collection.JavaConversions._
Hi All,
Does anyone have fair scheduling working for them in a hive server? I have
one hive thriftserver running and multiple users trying to run queries at
the same time on that server using a beeline client. I see that a big query
is stopping all other queries from making any progress. Is this
Not the most obvious place in the docs... but this is probably helpful:
https://spark.apache.org/docs/latest/sql-programming-guide.html#scheduling
You likely want to put each user in their own pool.
On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood wrote:
> Hi All,
>
> Does
Dear all
I want to setup spark in cluster mode. The problem is that each worker node
is looking for a file to process.in its local directory.is it
possible to setup some thing hdfs so that each worker node take its part
of a file from hdfsany good tutorials for this?
Thanks
I am having issues trying to setup spark to run jobs simultaneously.
I thought I wanted FAIR scheduling?
I used the templated fairscheduler.xml as is when I start pyspark I see the
3 expected pools:
production, test, and default
when I login as second user and run pyspark
I see the
I am new to Spark, and this user community, so my apologies if this was
answered elsewhere and I missed it (I did try search first).
We have multiple large RDDs stored across a HDFS via Spark (by calling
pairRDD.saveAsNewAPIHadoopFile()), and one thing we need to do is re-load a
given RDD (by
I am having issues trying to setup spark to run jobs simultaneously.
I thought I wanted FAIR scheduling?
I used the templated fairscheduler.xml as is when I start pyspark I see the
3 expected pools:
production, test, and default
when I login as second user and run pyspark
I see the expected
For compatibility reasons, we always write data out as nullable in
parquet. Given that that bit is only an optimization that we don't
actually make much use of, I'm curious why you are worried that its
changing to true?
On Tue, Oct 20, 2015 at 8:24 AM, Jerry Lam wrote:
>
My Code:
val dwsite
=
sc.sequenceFile("/sys/edw/dw_sites/snapshot/2015/10/18/00/part-r-0",classOf[Text],
classOf[Text])
val records = dwsite.filter {
case (k, v) =>
if(v.toString.indexOf("Bhutan") != -1) true else false
}
Hi All,
I am new to Spark and I am trying to understand how preemption works with
Spark on Yarn. My goal is to determine amount of re-work a Spark application
has to do if an executor is preempted.
For my test, I am using a 4 node cluster with Cloudera VM running Spark
1.3.0. I am running
Are you using hiveContext?
First, build your Spark using the following command:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver
-DskipTests clean package
Then, try this sample program
object SimpleApp {
case class Individual(name: String, surname: String, birthDate:
Just curious why you are using parseSql APIs?
It works well if you use the external APIs. For example, in your case:
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.sql("CREATE EXTERNAL TABLE IF NOT EXISTS `t`(`id` STRING, `foo`
INT) PARTITIONED BY (year INT, month
Yes.
We are trying to run a custom script written in C# using TRANSFORM, but cannot
get it work.
The query and error are below. Any suggestions? Thank you!
Spark version: 1.3
Here is how we add and invoke the script:
scala> hiveContext.sql("""ADD FILE wasb://… /NSSGraphHelper.exe""")
Yeah, I don't think this feature was designed to work on systems that don't
have bash. You could open a JIRA.
On Tue, Oct 20, 2015 at 10:36 AM, Yang Wu (Tata Consultancy Services) <
v-wuy...@microsoft.com> wrote:
> Yes.
>
> We are trying to run a custom script written in C# using TRANSFORM, but
btw, what version of Spark did you use?
On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote:
> I've connected Spark SQL to the Hive Metastore and currently I'm running
> SQL
> code via pyspark. Typically everything works fine, but sometimes after a
> long-running Spark SQL job I
Let me share my 2 cents.
First, this is not documented in the official document. Maybe we should do
it? http://spark.apache.org/docs/latest/sql-programming-guide.html
Second, nullability is a significant concept in the database people. It is
part of schema. Extra codes are needed for evaluating
I could be misreading the code, but looking at the code for toLocalIterator
(copied below), it should lazily call runJob on each partition in your input.
It shouldn't be parsing the entire RDD before returning from the first "next"
call. If it is taking a long time on the first "next" call, it
Hi Spark users and developers,
I have a dataframe with the following schema (Spark 1.5.1):
StructType(StructField(type,StringType,true),
StructField(timestamp,LongType,false))
After I save the dataframe in parquet and read it back, I get the following
schema:
// sort by 2nd element
Sorting.quickSort(pairs)(Ordering.by[(String, Int, Int), Int](_._2))
// sort by the 3rd element, then 1st
Sorting.quickSort(pairs)(Ordering[(Int, String)].on(x => (x._3, x._1)))
On Tue, Oct 20, 2015 at 11:33 AM, Carol McDonald
wrote:
> this works
Nabble is not related to the mailing list, so I think that's the problem.
spark.apache.org -> Community -> Mailing Lists:
http://spark.apache.org/community.html
These instructions from the project website itself are correct.
On Tue, Oct 20, 2015 at 5:22 PM, Jeff Sadowski
this works
val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))
or
val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ + _).top(10)(Ordering.by(_._2))
On Tue, Oct 20,
I need to dig deeper into saveAsHadoopDataset to see what might have caused
the effect you observed.
Cheers
On Tue, Oct 20, 2015 at 8:57 AM, Amit Hora wrote:
> Hi Ted,
>
> I made mistake last time yes the connection are very controlled when I
> used put like iterated over
Do you want to compare within the rdd or do you have some external list or
data coming in ?
For matching, you could look at string edit distances or cosine similarity
if you are only comparing title strings.
On Oct 20, 2015 9:09 PM, "Ascot Moss" wrote:
> Hi,
>
> I have my
Hi Phil,
as you can see in the Jira, I tried with both Spark 1.5.1 and
1.6.0-SNAPSHOT, and I'm not able to reproduce your issue.
What's your Java version (and I guess you use scala 2.10 for
compilation, not -Pscala-2.11) ?
Regards
JB
On 10/20/2015 03:59 PM, Jean-Baptiste Onofré wrote:
Hi
Hi Ted,
I made mistake last time yes the connection are very controlled when I used put
like iterated over rdd for each and within that for each partition made
connection and executed put list for hbase
But why it was that the connection were getting too much when I used hibconf
and
We support TRANSFORM. Are you having a problem using it?
On Tue, Oct 20, 2015 at 8:21 AM, wuyangjack wrote:
> How to reuse hive custom transform scripts written in python or c++?
>
> These scripts process data from stdin and print to stdout in spark.
> They use the
How to reuse hive custom transform scripts written in python or c++?
These scripts process data from stdin and print to stdout in spark.
They use the Transform Syntax in Hive:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform
Example in Hive:
SELECT TRANSFORM(stuff)
Hi,
I have my RDD that stores the titles of some articles:
1. "About Spark Streaming"
2. "About Spark MLlib"
3. "About Spark SQL"
4. "About Spark Installation"
5. "Kafka Streaming"
6. "Kafka Setup"
7.
I need to build a model to find titles by similarity,
e.g
if given "About Spark", hope to
Hi Jeff,
Hard to say what's going on. I have had problems subscribing to the Apache
lists in the past. My problems, which may be different than yours, were
caused by replying to the confirmation request from a different email
account than the account I was trying to subscribe from. It was easy
That will depend on what is your transformation , your code snippet might
help .
On Tue, Oct 20, 2015 at 1:53 AM, shahid ashraf wrote:
> Hi
>
> Any idea why is 50 GB shuffle read and write for 3.3 gb data
>
> On Mon, Oct 19, 2015 at 11:58 PM, Kartik Mathur
Thats not really intended to be a public API as there is some internal
setup that needs to be done for Hive to work. Have you created a
HiveContext in the same thread? Is there more to that stacktrace?
On Tue, Oct 20, 2015 at 2:25 AM, Ayoub wrote:
> Hello,
>
>
Hi TD,
Yes saveToCassandra throws exception. How do I fail that task explicitly if
i catch any exceptions?.
Right now that batch doesn't fail and remain in hung state. Is there any
way I fail that batch so that it can be tried again.
Thanks
Varun
On Tue, Oct 20, 2015 at 2:50 AM, Tathagata Das
also check out wholeTextFiles
https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int)
On 20 October 2015 at 15:04, Lan Jiang wrote:
> As Francois pointed out, you are encountering a classic small file
>
If I have a cluster with 7 nodes, each having an equal amount of cores and
create an RDD with sc.parallelize() it looks as if the Spark will always
tries to distribute the partitions.
Question:
(1) Is that something I can rely on?
(2) Can I rely that sc.parallelize() will assign partitions to as
Hi Jeff,
did you try to send an e-mail to subscribe-u...@spark.apache.org ?
Then you will receive a confirmation e-mail, you just have to reply
(nothing special to put in the subject, the important thing is the
reply-to unique address).
Regards
JB
On 10/20/2015 05:48 PM,
I believe it will be most efficient to let top(n) do the work, rather than
sort the whole RDD and then take the first n. The reason is that top and
takeOrdered know they need at most n elements from each partition, and then
just need to merge those. It's never required to sort the whole thing.
I
I am having issues subscribing to the user@spark.apache.org mailing list.
I would like to be added to the mailing list so I can post some
configuration questions I have to the list that I do not see asked on the
list.
When I tried adding myself I got an email titled "confirm subscribe to
Request to share if you come across any hint
-Original Message-
From: "Ted Yu"
Sent: 20-10-2015 21:30
To: "Amit Hora"
Cc: "user"
Subject: Re: Spark opening to many connection with zookeeper
I need to dig deeper into
is it suppose to be kind of hidden on how to subscribe to the mailing list?
When I go to http://apache-spark-user-list.1001560.n3.nabble.com/ (the link
that comes up on a google search for the spark mailing list)
I was able to register on nabble.com with no issues. Then when I tried
posting it
You have 2 options
a) don't use partitioning, if the table is small spark will only use one
task to load it
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
b) create a view that includes hashcode column
>
> First, this is not documented in the official document. Maybe we should do
> it? http://spark.apache.org/docs/latest/sql-programming-guide.html
>
Pull requests welcome.
> Second, nullability is a significant concept in the database people. It is
> part of schema. Extra codes are needed for
On my Mac:
$ ls -l ~/.m2/repository/org/antlr/antlr/3.2/antlr-3.2.jar
-rw-r--r-- 1 tyu staff 895124 Dec 17 2013
/Users/tyu/.m2/repository/org/antlr/antlr/3.2/antlr-3.2.jar
Looks like there might be network issue on your computer.
Can you check ?
Thanks
On Tue, Oct 20, 2015 at 1:21 PM,
Hello all community members,
I need opinion of people who was using Spark before and can share there
experience to help me select technical approach.
I have a project in Proof Of Concept phase, where we are evaluating possibility
of Spark usage for our use case.
Here is brief task description.
Sure. Will try to do a pull request this week.
Schema evolution is always painful for database people. IMO, NULL is a bad
design in the original system R. It introduces a lot of problems during the
system migration and data integration.
Let me find a possible scenario: RDBMS is used as an ODS.
That is actually a bug in the UI that got fixed in 1.5.1. The batch is
actually completing with exception, the UI does not update correctly.
On Tue, Oct 20, 2015 at 8:38 AM, varun sharma
wrote:
> Also, As you can see the timestamps in attached image. batches coming
>
I think it should use the default parallelism which by default is equal to the
number of cores in your cluster.
If you want to control it, specify a value for numSlices - the second param to
parallelize().
-adrian
On 10/20/15, 6:13 PM, "t3l" wrote:
>If I have a
> On Oct 20, 2015, at 5:31 PM, ravi.gawai wrote:
>
> you can use mapfunction..
>
> This is java example..
>
> final JavaRDD rdd1 = sc.textFile("filepath").map((line) -> {
> //logic for line to product converstion});
>
> Product class might have 5 attributes like you
Hi all,
I am really newie in Spark and Scala.
I cannot get the statistic result from a RDD. Is someone could help me on
this?
Current code is as follows:
/import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val sqlContext = new
As an academic aside, note that all datatypes are nullable according to the
SQL Standard. NOT NULL is modelled in the Standard as a constraint on data
values, not as a parallel universe of special data types. However, very few
databases implement NOT NULL via integrity constraints. Instead,
you can use mapfunction..
This is java example..
final JavaRDD rdd1 = sc.textFile("filepath").map((line) -> {
//logic for line to product converstion});
Product class might have 5 attributes like you said class Product{
String str1;
int i1;
String str2;
int i2;
String str3;
// with getter
I think the data file is binary per the original post. So in this case,
sc.binaryFiles should be used. However, I still recommend against using so many
small binary files as
1. They are not good for batch I/O
2. They put too many memory pressure on namenode.
Lan
> On Oct 20, 2015, at 11:20
Thanks, but I still don’t get it.
I have used groupBy to group data by userID, and for each ID, I need to get the
statistic information.
Best
Frank
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, October 20, 2015 3:12 PM
To: ChengBo
Cc: user
Subject: Re: Get statistic result from RDD
Thanks Michael, I'll try it out. Another quick/important question: How do I
make udfs available to all of the hive thriftserver users? Right now, when
I launch a spark-sql client, I notice that it reads the ~/.hiverc file and
all udfs get picked up but this doesn't seem to be working in hive
Hi SF based folks,
I'm going to try doing some simple office hours this Friday afternoon
outside of Paramo Coffee. If no one comes by I'll just be drinking coffee
hacking on some Spark PRs so if you just want to hangout and hack on Spark
as a group come by too. (See
Pete:
Please don't mix unrelated email on the back of another thread.
To unsubscribe, see first section of https://spark.apache.org/community
On Tue, Oct 20, 2015 at 2:42 PM, Pete Zybrick wrote:
>
>
>
1.3 on cdh 5.4.4 ... I'll take the responses to mean that the fix will be
probably a few months away for us. Not a huge problem but something I've
run into a number of times.
On Tue, Oct 20, 2015 at 3:01 PM, Yin Huai wrote:
> btw, what version of Spark did you use?
>
> On
Please take a look at:
examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
Cheers
On Tue, Oct 20, 2015 at 3:18 PM, ChengBo wrote:
> Thanks, but I still don’t get it.
>
> I have used groupBy to group data by userID, and for each ID, I need to
>
I tried, but it shows:
“error: value reduceByKey is not a member of iterable[((Int, Int, String,
String), String), Int]”
Best
Frank
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, October 20, 2015 3:46 PM
To: ChengBo
Cc: user
Subject: Re: Get statistic result from RDD
Please take a
Your mapValues can emit a tuple. If p(0) is between 0 and 5, first
component of tuple would be 1, second being 0.
If p(0) is 6 or 7, first component of tuple would be 0, second being 1.
You can use reduceByKey to sum up corresponding component.
On Tue, Oct 20, 2015 at 1:33 PM, Shepherd
To find the top 10 counts , which is better using top(10) with Ordering on
the value,
or swapping the key value and ordering on the key ? For example which is
better below ?
Or does it matter
val top10 = logs.filter(log => log.responseCode != 200).map(log =>
(log.endpoint, 1)).reduceByKey(_ +
I used that also but the number of connection goes on increasing started frm 10
and went till 299
Than I changed my zookeeper conf to set max client connection to just 30 and
restarted job
Now the connections are between 18- 24 from last 2 hours
I am unable to understand such a behaviour
I tried, but it shows:
“error: value reduceByKey is not a member of iterable[((Int, Int, String,
String), String), Int]”
Best
Cheng
From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Tuesday, October 20, 2015 3:46 PM
To: ChengBo
Cc: user
Subject: Re: Get statistic result from RDD
Please take a
I'm using Spark on Windows 10 (64 bit) and jdk8u60
When I start spark-shell, I obtain many warnings and exceptions. The
system complains about already registered datanucleus plugins, emit
exceptions and so on:
C:\test>spark-shell
log4j:WARN No appenders could be found for logger
Hi,
Currently I have a job that has spills to disk and memory due to usage of
reduceByKey and a lot of intermediate data in reduceByKey that gets
shuffled.
How to use custom partitioner in Spark Streaming for an intermediate stage
so that the next stage that uses reduceByKey does not have to
Hello,
When joining 2 DataFrames which originate from the same initial DataFrame,
why can't org.apache.spark.sql.DataFrame.apply(colName: String) method
distinguish which column to read?
Let me illustrate this question with a simple example (ran on Spark 1.5.1):
//my initial DataFrame
scala> df
Yes, I don't think there is. You can use SparkContext.parallelize() to
make a RDD from a list. But no location preferences supported yet.
On Sat, Oct 17, 2015 at 8:42 AM, Philip Weaver
wrote:
> I believe what I want is the exact functionality provided by
>
Hi All,
I am using Hbase 1.1.1 ,I came across a post describing hbase-spark included in
hbase core
I am trying to use HbaseContext but cnt find the appropriate lib while trying
to add following in pim I am getting missing artifact error
Org.apache.hbase
Hbase
1.1.1
-Original Message-
Thanks, we decided to try to add the support ourselves :).
On Tue, Oct 20, 2015 at 6:40 PM, Jeff Zhang wrote:
> Yes, I don't think there is. You can use SparkContext.parallelize() to
> make a RDD from a list. But no location preferences supported yet.
>
> On Sat, Oct 17, 2015
Hi Everyone,
Any one has any idea if spark-1.5.1 is available as a service on
HortonWorks ? I have spark-1.3.1 installed on the Cluster and it is a
HortonWorks distribution. Now I want upgrade it to spark-1.5.1. Anyone here
have any idea about it? Thank you in advance.
Regards,
Ajay
BTW, I think Json Parser should verify the json format at least when
inferring the schema of json.
On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang wrote:
> I think this is due to the json file format. DataFrame can only accept
> json file with one valid record per line.
Doug
is it possible to put in HDP 2.3?
esp in Sandbox
can share how do you install it?
F
--
Frans Thamura (曽志胜)
Java Champion
Shadow Master and Lead Investor
Meruvian.
Integrated Hypermedia Java Solution Provider.
Mobile: +628557888699
Blog: http://blogs.mervpolis.com/roller/flatburger (id)
Hi Cody,
What other options do I have other than monitoring and restarting the job?
Can the job recover automatically?
Thanks,
Sweth
On Thu, Oct 1, 2015 at 7:18 AM, Cody Koeninger wrote:
> Did you check you kafka broker logs to see what was going on during that
> time?
>
>
I think this is due to the json file format. DataFrame can only accept
json file with one valid record per line. Multiple line per record is
invalid for DataFrame.
On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu wrote:
> Could you create a JIRA to track this bug?
>
> On
Hi all,
I am trying to calculate Histogram of all columns from a CSV file using
Spark Scala.
I found that DoubleRDDFunctions supporting Histogram.
So i coded like following for getting histogram of all columns.
1. Get column count
2. Create RDD[double] of each column and calculate Histogram of
You are correct, that was the issue.
On Tue, Oct 20, 2015 at 10:18 PM, Jeff Zhang wrote:
> BTW, I think Json Parser should verify the json format at least when
> inferring the schema of json.
>
> On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang wrote:
>
>> I
I have been running 1.5.1 with Hive in secure mode on HDP 2.2.4 without any
problems.
Doug
> On Oct 21, 2015, at 12:05 AM, Ajay Chander wrote:
>
> Hi Everyone,
>
> Any one has any idea if spark-1.5.1 is available as a service on HortonWorks
> ? I have spark-1.3.1
Hi Frans,
You could download Spark 1.5.1-hadoop 2.6 pre-built tarball and copy into
HDP 2.3 sandbox or master node. Then copy all the conf files from
/usr/hdp/current/spark-client/ to your /conf, or you could
refer to this tech preview (
You should aggregate your files in larger chunks before doing anything
else. HDFS is not fit for small files. It will bloat it and cause you a
lot of performance issues. Target a few hundred MB chunks partition size
and then save those files back to hdfs and then delete the original
ones. You can
1 - 100 of 102 matches
Mail list logo