Thank you.
But I'm getting same warnings and it's still preventing the archive from
being generated.
I've ran this on both OSX Lion and Ubuntu 12. Same error. No .gz file
On Mon, Oct 26, 2015 at 9:10 PM, Ted Yu wrote:
> Looks like '-Pyarn' was missing in your command.
>
>
Till Spark Streaming supports dynamic allocation, you could use
StreamingListener to monitor batch execution times and based on it
sparkContext.requestExecutors() and sparkContext.killExecutors() to add and
remove executors explicitly and .
On 26 October 2015 at 21:37, Ted Yu
Hi Meihua,
For categorical features, the ordinal issue can be solved by trying
all kind of different partitions 2^(q-1) -1 for q values into two
groups. However, it's computational expensive. In Hastie's book, in
9.2.4, the trees can be trained by sorting the residuals and being
learnt as if they
1) if you are using thrift server any cached tables would be cached for all
sessions (I am not sure if this was your question)
2) If you want to ensure that the smaller table in the join is replicated
to all nodes, you can do the following
left.join(broadcast(right), "joinKey")
look at this
Hi,
I write my result to hdfs, it did well:
val model =
lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
TrainFeature())(seqOp, combOp).values
model.map(a => (a.toKey() + "\t" + a.totalCount + "\t" +
a.positiveCount)).saveAsTextFile(modelDataPath);
But
Also I just remembered about cloudera’s contribution
http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
From: Deng Ching-Mallete
Date: Tuesday, October 27, 2015 at 12:03 PM
To: avivb
Cc: user
Subject: Re: There is any way to write from spark to HBase
Any ideas? This is so important because we use kafka direct streaming and
save processed offsets manually as last step in the job, so we archive
at-least-once.
But see what happens when new batch is scheduled after a job fails:
- suppose we start from offset 10 loaded from zookeeper
- job starts
This is probably too low level but you could consider the async client inside
foreachRdd:
https://github.com/OpenTSDB/asynchbase
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
On 10/27/15, 11:36 AM, "avivb" wrote:
Hi,
It would be more efficient if you configure the table and flush the commits
by partition instead of per element in the RDD. The latter works fine
because you only have 4 elements, but it won't bid well for large data sets
IMO..
Thanks,
Deng
On Tue, Oct 27, 2015 at 5:22 PM, jinhong lu
It's still in HBase' trunk, scheduled for 2.0.0 release based on Jira
ticket.
-Deng
On Tue, Oct 27, 2015 at 6:35 PM, Fengdong Yu
wrote:
> Does this released with Spark1.*? or still kept in the trunk?
>
>
>
>
> On Oct 27, 2015, at 6:22 PM, Adrian Tanase
You can get a feel for it by playing with the original library published as
separate project on github
https://github.com/cloudera-labs/SparkOnHBase
From: Deng Ching-Mallete
Date: Tuesday, October 27, 2015 at 12:39 PM
To: Fengdong Yu
Cc: Adrian Tanase, avivb, user
Subject: Re: There is any way
Hi
I notice that you configured the following :
configuration.set("hbase.master", "192.168.1:6");
Did you mistyped the host IP ?
Best,
Sun.
fightf...@163.com
发件人: jinhong lu
发送时间: 2015-10-27 17:22
收件人: spark users
主题: spark to hbase
Hi,
I write my result to hdfs, it did well:
val
Does this released with Spark1.*? or still kept in the trunk?
> On Oct 27, 2015, at 6:22 PM, Adrian Tanase wrote:
>
> Also I just remembered about cloudera’s contribution
> http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
>
Hi,
I have grouped all my customers in JavaPairRDD
by there customerId (of Long type). Means every customerId have a List or
ProductBean.
Now i want to save all ProductBean to DB irrespective of customerId. I got
all values by using method
JavaRDD values =
I have already try it with https://github.com/unicredit/hbase-rdd and
https://github.com/nerdammer/spark-hbase-connector and in both cases I get
timeout.
So I would like to know about other option to write from Spark to HBase
CDH4.
Thanks!
--
View this message in context:
Dear All,
I will program a small project by spark, and the run speed is big concern.
I have a question, since RDD is always big on the cluster, is it proper to make
RDD variable as parameter transferred during function call ?
Thank you,Zhiliang
Hi Nicholas,
I think you are right about the issue relating to Spark-11126, I'm seeing
it as well.
Did you find any workaround? Looking at the pull request for the fix it
doesn't look possible.
Best regards,
Patrick
On 15 October 2015 at 19:40, Nicholas Pritchard <
Hi all ,
I am using Cloudera's SparkObHbase to bulk insert in hbase ,Please find
below code
object test {
def main(args: Array[String]): Unit = {
val conf = ConfigFactory.load("connection.conf").getConfig("connection")
val
Jinghong:
Hadmin variable is not used. You can omit that line.
Which hbase release are you using ?
As Deng said, don't flush per row.
Cheers
> On Oct 27, 2015, at 3:21 AM, Deng Ching-Mallete wrote:
>
> Hi,
>
> It would be more efficient if you configure the table and
Does this help?
https://issues.apache.org/jira/browse/SPARK-5206
On 10/27/15, 1:53 PM, "Amit Singh Hora" wrote:
>Hi all ,
>
>I am using Cloudera's SparkObHbase to bulk insert in hbase ,Please find
>below code
>object test {
>
>def main(args: Array[String]): Unit =
Can you try the same command shown in the pull request ?
Thanks
> On Oct 27, 2015, at 12:40 AM, Kayode Odeyemi wrote:
>
> Thank you.
>
> But I'm getting same warnings and it's still preventing the archive from
> being generated.
>
> I've ran this on both OSX Lion and
The operator you’re looking for is .flatMap. It flattens all the results if you
have nested lists of results (e.g. A map over a source element can return zero
or more target elements)
I’m not very familiar with the Java APIs but in scala it would go like this
(keeping type annotations only as
I issued the same basic command and it worked fine.
RADTech-MBP:spark $ ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests
Which created: spark-1.6.0-SNAPSHOT-bin-hadoop-2.6.tgz in the root
directory of the project.
You'll need to convert it to a java.sql.Timestamp.
On Tue, Oct 27, 2015 at 4:33 PM, Bryan Jeffrey
wrote:
> Hello.
>
> I am working to create a persistent table using SparkSQL HiveContext. I
> have a basic Windows event case class:
>
> case class WindowsEvent(
>
Hello.
I am working to create a persistent table using SparkSQL HiveContext. I
have a basic Windows event case class:
case class WindowsEvent(
targetEntity: String,
targetEntityType: String,
dateTimeUtc: DateTime,
well I do not really need to do it while another job is editing them.
I just need to get the names of the folders when I read through
textFile("path/to/dir/*/*/*.js")
Using *native hadoop* libraries, can I do something like*
fs.copy("/my/path/*/*","new/path/")?*
Narek Galstyan
Նարեկ Գալստյան
One more thing we can try is before committing offset we can verify the
latest offset of that partition(in zookeeper) with fromOffset in
OffsetRange.
Just a thought...
Let me know if it works..
On Tue, Oct 27, 2015 at 9:00 PM, Cody Koeninger wrote:
> If you want to make
Hello All,
When I checked my running Stream job on WebUI, I can see that some RDDs are
being listed that were not requested to be cached. What more is that they
are growing! I've not asked them to be cached. What are they? Are they the
state (UpdateStateByKey)?
Only the rows in white are being
This won't work as you can never guarantee which files were read by Spark
if some other process is writing files to the same location. It would be
far less work to move files matching your pattern to a staging location and
then load them using sc.textFile. you should find hdfs file system calls
I used the following command:
make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Phive
-Phive-thriftserver -Pyarn
spark-1.6.0-SNAPSHOT-bin-custom-spark.tgz was generated (with patch from
SPARK-11348)
Can you try above command ?
Thanks
On Tue, Oct 27, 2015 at 7:03 AM, Kayode Odeyemi
Hi,
I've a partitioned table in Hive (Avro) that I can query alright from hive
cli.
When using SparkSQL, I'm able to query some of the partitions, but getting
exception on some of the partitions.
The query is:
sqlContext.sql("select * from myTable where source='http' and date =
If you want to make sure that your offsets are increasing without gaps...
one way to do that is to enforce that invariant when you're saving to your
database. That would probably mean using a real database instead of
zookeeper though.
On Tue, Oct 27, 2015 at 4:13 AM, Krot Viacheslav
On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam wrote:
> Anyone experiences issues in setting hadoop configurations after
> SparkContext is initialized? I'm using Spark 1.5.1.
>
> I'm trying to use s3a which requires access and secret key set into hadoop
> configuration. I tried
If setting the values in SparkConf works, there's probably some bug in
the SQL code; e.g. creating a new Configuration object instead of
using the one in SparkContext. But I'm not really familiar with that
code.
On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam wrote:
> Hi
Hello,
I have developed a hadoop based solution that process a binary file. This
uses classic hadoop MR technique. The binary file is about 10GB and divided
into 73 HDFS blocks, and the business logic written as map process operates
on each of these 73 blocks. We have developed a
Hi Spark users and developers,
Anyone experiences issues in setting hadoop configurations after
SparkContext is initialized? I'm using Spark 1.5.1.
I'm trying to use s3a which requires access and secret key set into hadoop
configuration. I tried to set the properties in the hadoop configuration
Mind sharing the error you are getting?
On 28 Oct 2015 03:53, "Balachandar R.A." wrote:
> Hello,
>
>
> I have developed a hadoop based solution that process a binary file. This
> uses classic hadoop MR technique. The binary file is about 10GB and divided
> into 73 HDFS
Thanks gents.
Removal of 'clean package -U' made the difference.
On Tue, Oct 27, 2015 at 6:39 PM, Todd Nist wrote:
> I issued the same basic command and it worked fine.
>
> RADTech-MBP:spark $ ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn
> -Phadoop-2.6
So I'm going to try and do these again, with an on-line (
http://doodle.com/poll/cr9vekenwims4sna ) and SF version (
http://doodle.com/poll/ynhputd974d9cv5y ). You can help me pick a day that
works for you by filling out the doodle (if none of the days fit let me
know and I can try and arrange
Actually a great idea, I even didn't think about that. Thanks a lot!
вт, 27 окт. 2015 г. в 17:29, varun sharma :
> One more thing we can try is before committing offset we can verify the
> latest offset of that partition(in zookeeper) with fromOffset in
> OffsetRange.
Hi Marcelo,
Thanks for the advice. I understand that we could set the configurations
before creating SparkContext. My question is
SparkContext.hadoopConfiguration.set("key","value") doesn't seem to
propagate to all subsequent SQLContext jobs. Note that I mentioned I can
load the parquet file but
No the job actually doesn't fail, but since our tests is generating all
these stacktraces i have disabled the tungsten mode just to be sure (and
don't have gazilion stacktraces in production).
2015-10-27 20:59 GMT+01:00 Josh Rosen :
> Hi Sjoerd,
>
> Did your job
Hi Marcelo,
I tried setting the properties before instantiating spark context via
SparkConf. It works fine.
Originally, the code I have read hadoop configurations from hdfs-site.xml
which works perfectly fine as well.
Therefore, can I conclude that sparkContext.hadoopConfiguration.set("key",
I have disabled it because of it started generating ERROR's when upgrading
from Spark 1.4 to 1.5.1
2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to
generate ordering, fallback to interpreted
java.util.concurrent.ExecutionException: java.lang.Exception: failed to
compile:
Hello all,
What I wanted to do is configure the spark streaming job to read the
database using JdbcRDD and cache the results. This should occur only once
at the start of the job. It should not make any further connection to DB
afterwards. Is it possible to do that?
Hi I have long running spark job which processes hadoop orc files and creates
one hive partitions. Even if I have created ExecturService thread pool and
use pool of 15 threads I see active job count as always 1 which makes job
slow. How do I increase active job count in UI? I remember earlier it
Hi Sjoerd,
Did your job actually *fail* or did it just generate many spurious
exceptions? While the stacktrace that you posted does indicate a bug, I
don't think that it should have stopped query execution because Spark
should have fallen back to an interpreted code path (note the "Failed to
I know it uses lazy model, which is why I was wondering.
On 27 October 2015 at 19:02, Uthayan Suthakar
wrote:
> Hello all,
>
> What I wanted to do is configure the spark streaming job to read the
> database using JdbcRDD and cache the results. This should occur only
Hi,
We currently use reduceByKey to reduce by a particular metric name in our
Streaming/Batch job. It seems to be doing a lot of shuffles and it has
impact on performance. Does using a custompartitioner before calling
reduceByKey improve performance?
Thanks,
Swetha
--
View this message in
Hi,
I use Spark for for its newAPIHadoopRDD method and map/reduce etc. tasks.
When I include it I see that it has many transitive dependencies.
Which of them I should exclude? I've included the dependency tree of
spark-core. Is there any documentation that explains why they are needed
(maybe all
If you just want to control the number of reducers, then setting the
numPartitions is sufficient. If you want to control how exact partitioning
scheme (that is some other scheme other than hash-based) then you need to
implement a custom partitioner. It can be used to improve data skews, etc.
which
Jinghong:
In one of earlier threads on storing data to hbase, it was found that
htrace jar was not on classpath, leading to write failure.
Can you check whether you are facing the same problem ?
Cheers
On Tue, Oct 27, 2015 at 5:11 AM, Ted Yu wrote:
> Jinghong:
> Hadmin
So, Wouldn't using a customPartitioner on the rdd upon which the
groupByKey or reduceByKey is performed avoid shuffles and improve
performance? My code does groupByAndSort and reduceByKey on different
datasets as shown below. Would using a custom partitioner on those datasets
before using a
We are using Kinesis with Spark Streaming 1.5 on a YARN cluster. When we
enable checkpointing in Spark, where in the Kinesis stream should a
restarted driver continue? I run a simple experiment as follows:
1. In the first driver run, Spark driver processes 1 million records
starting from
Your observation is correct! The current implementation of checkpointing to
DynamoDB is tied to the presence of new data from Kinesis (I think that
emulates the KCL behavior), if there is no data for while, the
checkpointing does not occur. That explains your observation.
I have filed a JIRA to
Hi DB Tsai,
Thank you again for your insightful comments!
1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing
Hello,
I had a question about error handling in Spark job: if an exception occurs
during the job, what is the best way to get notification of the failure?
Can Spark jobs return with different exit codes?
For example, I wrote a dummy Spark job just throwing out an Exception, as
follows:
import
Hi, Ted
thanks for your help.
I check the jar, it is in classpath, and now the problem is :
1、 Follow codes runs good, and it put the result to hbse:
val res =
lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new
TrainFeature())(seqOp,
I would like to store my data in a Berkeley DB in Hadoop and run Spark for
data processing.
Is it possible?
Thanks
Mina
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Berkeley-DB-storage-for-Spark-tp25215.html
Sent from the Apache Spark User List mailing
Hi Anand, can you paste the table creating statement? I’d like to reproduce
that in my local first, and BTW, which version are you using?
Hao
From: Anand Nalya [mailto:anand.na...@gmail.com]
Sent: Tuesday, October 27, 2015 11:35 PM
To: spark users
Subject: SparkSQL on hive error
Hi,
I've a
After a draft glance, seems a bug in Spark SQL, do you mind to create a jira
for this? And then I can start to fix it.
Thanks,
Hao
From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, October 28, 2015 3:13 AM
To: Marcelo Vanzin
Cc: user@spark.apache.org
Subject: Re: [Spark-SQL]:
if you specify the same partitioner (custom or otherwise) for both
partitionBy and groupBy, then may be it will help. The fundamental problem
is groupByKey, that takes a lot of working memory.
1. Try to avoid groupByKey. What is it that you want to after sorting the
list of grouped events? can you
For #2, have you checked task log(s) to see if there was some clue ?
You may want to use foreachPartition to reduce the number of flushes.
In the future, please remove color coding - it is not easy to read.
Cheers
On Tue, Oct 27, 2015 at 6:53 PM, jinhong lu wrote:
> Hi,
If it is streaming, you can look at updateStateByKey for maintaining active
sessions. But wont work for batch.
and I answered that before. it can improve performance if you change the
partitioning scheme from hash-based to something else. Its hard to say
anything beyond that without understand
I write a demo, but still no response, no error, no log.
My hbase is 0.98, hadoop 2.3, spark 1.4.
And I run in yarn-client mode. any idea? thanks.
package com.lujinhong.sparkdemo
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.hadoop.conf.Configuration;
hi all,
when using command:
spark-submit --deploy-mode cluster --jars hdfs:///user/spark/cypher.jar
--class com.suning.spark.jdbc.MysqlJdbcTest hdfs:///user/spark/MysqlJdbcTest.jar
the program throw exception that cannot find class in cypher.jar, the driver
log show no --jars
Did you try the sc.binaryFiles() which gives you an RDD of
PortableDataStream that wraps around the underlying bytes.
On Tue, Oct 27, 2015 at 10:23 PM, Balachandar R.A. wrote:
> Hello,
>
>
> I have developed a hadoop based solution that process a binary file. This
>
Hi,
The spark assembly jar already includes the spark core libraries plus their
transitive dependencies, so you don't need to include them in your jar. I
found it easier to use inclusions instead of exclusions when creating an
assembly jar of my spark job so I would recommend going with that.
Also, please remove the HBase related to the Scala Object, this will resolve
the serialize issue and avoid open connection repeatedly.
and remember close the table after the final flush.
> On Oct 28, 2015, at 10:13 AM, Ted Yu wrote:
>
> For #2, have you checked task
After sorting the list of grouped events I would need to have an RDD that
has a key which is nothing but the sessionId and a list of values that are
sorted by timeStamp for each input Json. So basically the return type would
be RDD[(String, List[(Long, String)] where the key is the sessionId and
When you build docs, you are not running Spark. It looks like you are
though. These commands are just executed in the shell.
On Tue, Oct 27, 2015 at 2:09 PM, Alex Luya wrote:
> followed this
>
> https://github.com/apache/spark/blob/master/docs/README.md
>
> to build
Dear Spark users,
I am reading a set of json files to compile them to Parquet data format.
I am willing to mark the folders in some way after having read their
contents so that I do not read it again(e.g. I can changed the name of the
folder).
I use .textFile("path/to*/dir/*/*/*.js") *technique
Seems the build and directory structure in dist is similar to the .gz file
downloaded from the
downloads page. Can the dist directory be used as is?
On Tue, Oct 27, 2015 at 4:03 PM, Kayode Odeyemi wrote:
> Ted, I switched to this:
>
> ./make-distribution.sh --name
Ted, I switched to this:
./make-distribution.sh --name spark-latest --tgz -Dhadoop.version=2.6.0
-Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -DskipTests clean package -U
Same error. No .gz file. Here's the bottom output log:
+ rm -rf /home/emperor/javaprojects/spark/dist
+ mkdir -p
followed this
https://github.com/apache/spark/blob/master/docs/README.md
to build spark docs,but it hangs on:
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See
75 matches
Mail list logo