Re: Maven build failed (Spark master)

2015-10-27 Thread Kayode Odeyemi
Thank you. But I'm getting same warnings and it's still preventing the archive from being generated. I've ran this on both OSX Lion and Ubuntu 12. Same error. No .gz file On Mon, Oct 26, 2015 at 9:10 PM, Ted Yu wrote: > Looks like '-Pyarn' was missing in your command. > >

Re: Dynamic Resource Allocation with Spark Streaming (Standalone Cluster, Spark 1.5.1)

2015-10-27 Thread Deenar Toraskar
Till Spark Streaming supports dynamic allocation, you could use StreamingListener to monitor batch execution times and based on it sparkContext.requestExecutors() and sparkContext.killExecutors() to add and remove executors explicitly and . On 26 October 2015 at 21:37, Ted Yu

Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
Hi Meihua, For categorical features, the ordinal issue can be solved by trying all kind of different partitions 2^(q-1) -1 for q values into two groups. However, it's computational expensive. In Hastie's book, in 9.2.4, the trees can be trained by sorting the residuals and being learnt as if they

Re: Broadcast table

2015-10-27 Thread Deenar Toraskar
1) if you are using thrift server any cached tables would be cached for all sessions (I am not sure if this was your question) 2) If you want to ensure that the smaller table in the join is replicated to all nodes, you can do the following left.join(broadcast(right), "joinKey") look at this

spark to hbase

2015-10-27 Thread jinhong lu
Hi, I write my result to hdfs, it did well: val model = lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new TrainFeature())(seqOp, combOp).values model.map(a => (a.toKey() + "\t" + a.totalCount + "\t" + a.positiveCount)).saveAsTextFile(modelDataPath); But

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Adrian Tanase
Also I just remembered about cloudera’s contribution http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ From: Deng Ching-Mallete Date: Tuesday, October 27, 2015 at 12:03 PM To: avivb Cc: user Subject: Re: There is any way to write from spark to HBase

Re: correct and fast way to stop streaming application

2015-10-27 Thread Krot Viacheslav
Any ideas? This is so important because we use kafka direct streaming and save processed offsets manually as last step in the job, so we archive at-least-once. But see what happens when new batch is scheduled after a job fails: - suppose we start from offset 10 loaded from zookeeper - job starts

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Adrian Tanase
This is probably too low level but you could consider the async client inside foreachRdd: https://github.com/OpenTSDB/asynchbase http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd On 10/27/15, 11:36 AM, "avivb" wrote:

Re: spark to hbase

2015-10-27 Thread Deng Ching-Mallete
Hi, It would be more efficient if you configure the table and flush the commits by partition instead of per element in the RDD. The latter works fine because you only have 4 elements, but it won't bid well for large data sets IMO.. Thanks, Deng On Tue, Oct 27, 2015 at 5:22 PM, jinhong lu

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Deng Ching-Mallete
It's still in HBase' trunk, scheduled for 2.0.0 release based on Jira ticket. -Deng On Tue, Oct 27, 2015 at 6:35 PM, Fengdong Yu wrote: > Does this released with Spark1.*? or still kept in the trunk? > > > > > On Oct 27, 2015, at 6:22 PM, Adrian Tanase

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Adrian Tanase
You can get a feel for it by playing with the original library published as separate project on github https://github.com/cloudera-labs/SparkOnHBase From: Deng Ching-Mallete Date: Tuesday, October 27, 2015 at 12:39 PM To: Fengdong Yu Cc: Adrian Tanase, avivb, user Subject: Re: There is any way

回复: spark to hbase

2015-10-27 Thread fightf...@163.com
Hi I notice that you configured the following : configuration.set("hbase.master", "192.168.1:6"); Did you mistyped the host IP ? Best, Sun. fightf...@163.com 发件人: jinhong lu 发送时间: 2015-10-27 17:22 收件人: spark users 主题: spark to hbase Hi, I write my result to hdfs, it did well: val

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Fengdong Yu
Does this released with Spark1.*? or still kept in the trunk? > On Oct 27, 2015, at 6:22 PM, Adrian Tanase wrote: > > Also I just remembered about cloudera’s contribution > http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ >

Separate all values from Iterable

2015-10-27 Thread Shams ul Haque
Hi, I have grouped all my customers in JavaPairRDD by there customerId (of Long type). Means every customerId have a List or ProductBean. Now i want to save all ProductBean to DB irrespective of customerId. I got all values by using method JavaRDD values =

There is any way to write from spark to HBase CDH4?

2015-10-27 Thread avivb
I have already try it with https://github.com/unicredit/hbase-rdd and https://github.com/nerdammer/spark-hbase-connector and in both cases I get timeout. So I would like to know about other option to write from Spark to HBase CDH4. Thanks! -- View this message in context:

is it proper to make RDD as function parameter in the codes

2015-10-27 Thread Zhiliang Zhu
Dear All, I will program a small project by spark, and the run speed is big concern. I have a question, since RDD is always big on the cluster, is it proper to make RDD variable as parameter transferred during function call ? Thank you,Zhiliang

Re: Spark UI consuming lots of memory

2015-10-27 Thread Patrick McGloin
Hi Nicholas, I think you are right about the issue relating to Spark-11126, I'm seeing it as well. Did you find any workaround? Looking at the pull request for the fix it doesn't look possible. Best regards, Patrick On 15 October 2015 at 19:40, Nicholas Pritchard <

SPARKONHBase checkpointing issue

2015-10-27 Thread Amit Singh Hora
Hi all , I am using Cloudera's SparkObHbase to bulk insert in hbase ,Please find below code object test { def main(args: Array[String]): Unit = { val conf = ConfigFactory.load("connection.conf").getConfig("connection") val

Re: spark to hbase

2015-10-27 Thread Ted Yu
Jinghong: Hadmin variable is not used. You can omit that line. Which hbase release are you using ? As Deng said, don't flush per row. Cheers > On Oct 27, 2015, at 3:21 AM, Deng Ching-Mallete wrote: > > Hi, > > It would be more efficient if you configure the table and

Re: SPARKONHBase checkpointing issue

2015-10-27 Thread Adrian Tanase
Does this help? https://issues.apache.org/jira/browse/SPARK-5206 On 10/27/15, 1:53 PM, "Amit Singh Hora" wrote: >Hi all , > >I am using Cloudera's SparkObHbase to bulk insert in hbase ,Please find >below code >object test { > >def main(args: Array[String]): Unit =

Re: Maven build failed (Spark master)

2015-10-27 Thread Ted Yu
Can you try the same command shown in the pull request ? Thanks > On Oct 27, 2015, at 12:40 AM, Kayode Odeyemi wrote: > > Thank you. > > But I'm getting same warnings and it's still preventing the archive from > being generated. > > I've ran this on both OSX Lion and

Re: Separate all values from Iterable

2015-10-27 Thread Adrian Tanase
The operator you’re looking for is .flatMap. It flattens all the results if you have nested lists of results (e.g. A map over a source element can return zero or more target elements) I’m not very familiar with the Java APIs but in scala it would go like this (keeping type annotations only as

Re: Maven build failed (Spark master)

2015-10-27 Thread Todd Nist
I issued the same basic command and it worked fine. RADTech-MBP:spark $ ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests Which created: spark-1.6.0-SNAPSHOT-bin-hadoop-2.6.tgz in the root directory of the project.

Re: Spark SQL Persistent Table - joda DateTime Compatability

2015-10-27 Thread Michael Armbrust
You'll need to convert it to a java.sql.Timestamp. On Tue, Oct 27, 2015 at 4:33 PM, Bryan Jeffrey wrote: > Hello. > > I am working to create a persistent table using SparkSQL HiveContext. I > have a basic Windows event case class: > > case class WindowsEvent( >

Spark SQL Persistent Table - joda DateTime Compatability

2015-10-27 Thread Bryan Jeffrey
Hello. I am working to create a persistent table using SparkSQL HiveContext. I have a basic Windows event case class: case class WindowsEvent( targetEntity: String, targetEntityType: String, dateTimeUtc: DateTime,

Re: get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")

2015-10-27 Thread Նարեկ Գալստեան
well I do not really need to do it while another job is editing them. I just need to get the names of the folders when I read through textFile("path/to/dir/*/*/*.js") Using *native hadoop* libraries, can I do something like* fs.copy("/my/path/*/*","new/path/")?* Narek Galstyan Նարեկ Գալստյան

Re: correct and fast way to stop streaming application

2015-10-27 Thread varun sharma
One more thing we can try is before committing offset we can verify the latest offset of that partition(in zookeeper) with fromOffset in OffsetRange. Just a thought... Let me know if it works.. On Tue, Oct 27, 2015 at 9:00 PM, Cody Koeninger wrote: > If you want to make

[Spark Streaming] Why are some uncached RDDs are growing?

2015-10-27 Thread diplomatic Guru
Hello All, When I checked my running Stream job on WebUI, I can see that some RDDs are being listed that were not requested to be cached. What more is that they are growing! I've not asked them to be cached. What are they? Are they the state (UpdateStateByKey)? Only the rows in white are being

Re: get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")

2015-10-27 Thread Deenar Toraskar
This won't work as you can never guarantee which files were read by Spark if some other process is writing files to the same location. It would be far less work to move files matching your pattern to a staging location and then load them using sc.textFile. you should find hdfs file system calls

Re: Maven build failed (Spark master)

2015-10-27 Thread Ted Yu
I used the following command: make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn spark-1.6.0-SNAPSHOT-bin-custom-spark.tgz was generated (with patch from SPARK-11348) Can you try above command ? Thanks On Tue, Oct 27, 2015 at 7:03 AM, Kayode Odeyemi

SparkSQL on hive error

2015-10-27 Thread Anand Nalya
Hi, I've a partitioned table in Hive (Avro) that I can query alright from hive cli. When using SparkSQL, I'm able to query some of the partitions, but getting exception on some of the partitions. The query is: sqlContext.sql("select * from myTable where source='http' and date =

Re: correct and fast way to stop streaming application

2015-10-27 Thread Cody Koeninger
If you want to make sure that your offsets are increasing without gaps... one way to do that is to enforce that invariant when you're saving to your database. That would probably mean using a real database instead of zookeeper though. On Tue, Oct 27, 2015 at 4:13 AM, Krot Viacheslav

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Marcelo Vanzin
On Tue, Oct 27, 2015 at 10:43 AM, Jerry Lam wrote: > Anyone experiences issues in setting hadoop configurations after > SparkContext is initialized? I'm using Spark 1.5.1. > > I'm trying to use s3a which requires access and secret key set into hadoop > configuration. I tried

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Marcelo Vanzin
If setting the values in SparkConf works, there's probably some bug in the SQL code; e.g. creating a new Configuration object instead of using the one in SparkContext. But I'm not really familiar with that code. On Tue, Oct 27, 2015 at 11:22 AM, Jerry Lam wrote: > Hi

Using Hadoop Custom Input format in Spark

2015-10-27 Thread Balachandar R.A.
Hello, I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a

[Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Hi Spark users and developers, Anyone experiences issues in setting hadoop configurations after SparkContext is initialized? I'm using Spark 1.5.1. I'm trying to use s3a which requires access and secret key set into hadoop configuration. I tried to set the properties in the hadoop configuration

Re: Using Hadoop Custom Input format in Spark

2015-10-27 Thread ayan guha
Mind sharing the error you are getting? On 28 Oct 2015 03:53, "Balachandar R.A." wrote: > Hello, > > > I have developed a hadoop based solution that process a binary file. This > uses classic hadoop MR technique. The binary file is about 10GB and divided > into 73 HDFS

Re: Maven build failed (Spark master)

2015-10-27 Thread Kayode Odeyemi
Thanks gents. Removal of 'clean package -U' made the difference. On Tue, Oct 27, 2015 at 6:39 PM, Todd Nist wrote: > I issued the same basic command and it worked fine. > > RADTech-MBP:spark $ ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn > -Phadoop-2.6

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-27 Thread Holden Karau
So I'm going to try and do these again, with an on-line ( http://doodle.com/poll/cr9vekenwims4sna ) and SF version ( http://doodle.com/poll/ynhputd974d9cv5y ). You can help me pick a day that works for you by filling out the doodle (if none of the days fit let me know and I can try and arrange

Re: correct and fast way to stop streaming application

2015-10-27 Thread Krot Viacheslav
Actually a great idea, I even didn't think about that. Thanks a lot! вт, 27 окт. 2015 г. в 17:29, varun sharma : > One more thing we can try is before committing offset we can verify the > latest offset of that partition(in zookeeper) with fromOffset in > OffsetRange.

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Hi Marcelo, Thanks for the advice. I understand that we could set the configurations before creating SparkContext. My question is SparkContext.hadoopConfiguration.set("key","value") doesn't seem to propagate to all subsequent SQLContext jobs. Note that I mentioned I can load the parquet file but

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Sjoerd Mulder
No the job actually doesn't fail, but since our tests is generating all these stacktraces i have disabled the tungsten mode just to be sure (and don't have gazilion stacktraces in production). 2015-10-27 20:59 GMT+01:00 Josh Rosen : > Hi Sjoerd, > > Did your job

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
Hi Marcelo, I tried setting the properties before instantiating spark context via SparkConf. It works fine. Originally, the code I have read hadoop configurations from hdfs-site.xml which works perfectly fine as well. Therefore, can I conclude that sparkContext.hadoopConfiguration.set("key",

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Sjoerd Mulder
I have disabled it because of it started generating ERROR's when upgrading from Spark 1.4 to 1.5.1 2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to generate ordering, fallback to interpreted java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile:

[Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-27 Thread Uthayan Suthakar
Hello all, What I wanted to do is configure the spark streaming job to read the database using JdbcRDD and cache the results. This should occur only once at the start of the job. It should not make any further connection to DB afterwards. Is it possible to do that?

How to increase active job count to make spark job faster?

2015-10-27 Thread unk1102
Hi I have long running spark job which processes hadoop orc files and creates one hive partitions. Even if I have created ExecturService thread pool and use pool of 15 threads I see active job count as always 1 which makes job slow. How do I increase active job count in UI? I remember earlier it

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd, Did your job actually *fail* or did it just generate many spurious exceptions? While the stacktrace that you posted does indicate a bug, I don't think that it should have stopped query execution because Spark should have fallen back to an interpreted code path (note the "Failed to

Re: [Spark Streaming] Connect to Database only once at the start of Streaming job

2015-10-27 Thread diplomatic Guru
I know it uses lazy model, which is why I was wondering. On 27 October 2015 at 19:02, Uthayan Suthakar wrote: > Hello all, > > What I wanted to do is configure the spark streaming job to read the > database using JdbcRDD and cache the results. This should occur only

Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha
Hi, We currently use reduceByKey to reduce by a particular metric name in our Streaming/Batch job. It seems to be doing a lot of shuffles and it has impact on performance. Does using a custompartitioner before calling reduceByKey improve performance? Thanks, Swetha -- View this message in

Spark Core Transitive Dependencies

2015-10-27 Thread Furkan KAMACI
Hi, I use Spark for for its newAPIHadoopRDD method and map/reduce etc. tasks. When I include it I see that it has many transitive dependencies. Which of them I should exclude? I've included the dependency tree of spark-core. Is there any documentation that explains why they are needed (maybe all

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
If you just want to control the number of reducers, then setting the numPartitions is sufficient. If you want to control how exact partitioning scheme (that is some other scheme other than hash-based) then you need to implement a custom partitioner. It can be used to improve data skews, etc. which

Re: spark to hbase

2015-10-27 Thread Ted Yu
Jinghong: In one of earlier threads on storing data to hbase, it was found that htrace jar was not on classpath, leading to write failure. Can you check whether you are facing the same problem ? Cheers On Tue, Oct 27, 2015 at 5:11 AM, Ted Yu wrote: > Jinghong: > Hadmin

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
So, Wouldn't using a customPartitioner on the rdd upon which the groupByKey or reduceByKey is performed avoid shuffles and improve performance? My code does groupByAndSort and reduceByKey on different datasets as shown below. Would using a custom partitioner on those datasets before using a

expected Kinesis checkpoint behavior when driver restarts

2015-10-27 Thread Hster Geguri
We are using Kinesis with Spark Streaming 1.5 on a YARN cluster. When we enable checkpointing in Spark, where in the Kinesis stream should a restarted driver continue? I run a simple experiment as follows: 1. In the first driver run, Spark driver processes 1 million records starting from

Re: expected Kinesis checkpoint behavior when driver restarts

2015-10-27 Thread Tathagata Das
Your observation is correct! The current implementation of checkpointing to DynamoDB is tied to the presence of new data from Kinesis (I think that emulates the KCL behavior), if there is no data for while, the checkpointing does not occur. That explains your observation. I have filed a JIRA to

Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai, Thank you again for your insightful comments! 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing

How to catch error during Spark job?

2015-10-27 Thread Isabelle Phan
Hello, I had a question about error handling in Spark job: if an exception occurs during the job, what is the best way to get notification of the failure? Can Spark jobs return with different exit codes? For example, I wrote a dummy Spark job just throwing out an Exception, as follows: import

Re: spark to hbase

2015-10-27 Thread jinhong lu
Hi, Ted thanks for your help. I check the jar, it is in classpath, and now the problem is : 1、 Follow codes runs good, and it put the result to hbse: val res = lines.map(pairFunction).groupByKey().flatMap(pairFlatMapFunction).aggregateByKey(new TrainFeature())(seqOp,

Berkeley DB storage for Spark

2015-10-27 Thread Mina
I would like to store my data in a Berkeley DB in Hadoop and run Spark for data processing. Is it possible? Thanks Mina -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Berkeley-DB-storage-for-Spark-tp25215.html Sent from the Apache Spark User List mailing

RE: SparkSQL on hive error

2015-10-27 Thread Cheng, Hao
Hi Anand, can you paste the table creating statement? I’d like to reproduce that in my local first, and BTW, which version are you using? Hao From: Anand Nalya [mailto:anand.na...@gmail.com] Sent: Tuesday, October 27, 2015 11:35 PM To: spark users Subject: SparkSQL on hive error Hi, I've a

RE: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Cheng, Hao
After a draft glance, seems a bug in Spark SQL, do you mind to create a jira for this? And then I can start to fix it. Thanks, Hao From: Jerry Lam [mailto:chiling...@gmail.com] Sent: Wednesday, October 28, 2015 3:13 AM To: Marcelo Vanzin Cc: user@spark.apache.org Subject: Re: [Spark-SQL]:

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
if you specify the same partitioner (custom or otherwise) for both partitionBy and groupBy, then may be it will help. The fundamental problem is groupByKey, that takes a lot of working memory. 1. Try to avoid groupByKey. What is it that you want to after sorting the list of grouped events? can you

Re: spark to hbase

2015-10-27 Thread Ted Yu
For #2, have you checked task log(s) to see if there was some clue ? You may want to use foreachPartition to reduce the number of flushes. In the future, please remove color coding - it is not easy to read. Cheers On Tue, Oct 27, 2015 at 6:53 PM, jinhong lu wrote: > Hi,

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
If it is streaming, you can look at updateStateByKey for maintaining active sessions. But wont work for batch. and I answered that before. it can improve performance if you change the partitioning scheme from hash-based to something else. Its hard to say anything beyond that without understand

Re: spark to hbase

2015-10-27 Thread jinhong lu
I write a demo, but still no response, no error, no log. My hbase is 0.98, hadoop 2.3, spark 1.4. And I run in yarn-client mode. any idea? thanks. package com.lujinhong.sparkdemo import org.apache.spark._ import org.apache.spark.rdd.NewHadoopRDD import org.apache.hadoop.conf.Configuration;

--jars option using hdfs jars cannot effect when spark standlone deploymode with cluster

2015-10-27 Thread our...@cnsuning.com
hi all, when using command: spark-submit --deploy-mode cluster --jars hdfs:///user/spark/cypher.jar --class com.suning.spark.jdbc.MysqlJdbcTest hdfs:///user/spark/MysqlJdbcTest.jar the program throw exception that cannot find class in cypher.jar, the driver log show no --jars

Re: Using Hadoop Custom Input format in Spark

2015-10-27 Thread Sabarish Sasidharan
Did you try the sc.binaryFiles() which gives you an RDD of PortableDataStream that wraps around the underlying bytes. On Tue, Oct 27, 2015 at 10:23 PM, Balachandar R.A. wrote: > Hello, > > > I have developed a hadoop based solution that process a binary file. This >

Re: Spark Core Transitive Dependencies

2015-10-27 Thread Deng Ching-Mallete
Hi, The spark assembly jar already includes the spark core libraries plus their transitive dependencies, so you don't need to include them in your jar. I found it easier to use inclusions instead of exclusions when creating an assembly jar of my spark job so I would recommend going with that.

Re: spark to hbase

2015-10-27 Thread Fengdong Yu
Also, please remove the HBase related to the Scala Object, this will resolve the serialize issue and avoid open connection repeatedly. and remember close the table after the final flush. > On Oct 28, 2015, at 10:13 AM, Ted Yu wrote: > > For #2, have you checked task

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
After sorting the list of grouped events I would need to have an RDD that has a key which is nothing but the sessionId and a list of values that are sorted by timeStamp for each input Json. So basically the return type would be RDD[(String, List[(Long, String)] where the key is the sessionId and

Re: doc building process hangs on Failed to load class “org.slf4j.impl.StaticLoggerBinder”

2015-10-27 Thread Sean Owen
When you build docs, you are not running Spark. It looks like you are though. These commands are just executed in the shell. On Tue, Oct 27, 2015 at 2:09 PM, Alex Luya wrote: > followed this > > https://github.com/apache/spark/blob/master/docs/README.md > > to build

get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")

2015-10-27 Thread Նարեկ Գալստեան
Dear Spark users, I am reading a set of json files to compile them to Parquet data format. I am willing to mark the folders in some way after having read their contents so that I do not read it again(e.g. I can changed the name of the folder). I use .textFile("path/to*/dir/*/*/*.js") *technique

Re: Maven build failed (Spark master)

2015-10-27 Thread Kayode Odeyemi
Seems the build and directory structure in dist is similar to the .gz file downloaded from the downloads page. Can the dist directory be used as is? On Tue, Oct 27, 2015 at 4:03 PM, Kayode Odeyemi wrote: > Ted, I switched to this: > > ./make-distribution.sh --name

Re: Maven build failed (Spark master)

2015-10-27 Thread Kayode Odeyemi
Ted, I switched to this: ./make-distribution.sh --name spark-latest --tgz -Dhadoop.version=2.6.0 -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -DskipTests clean package -U Same error. No .gz file. Here's the bottom output log: + rm -rf /home/emperor/javaprojects/spark/dist + mkdir -p

doc building process hangs on Failed to load class “org.slf4j.impl.StaticLoggerBinder”

2015-10-27 Thread Alex Luya
followed this https://github.com/apache/spark/blob/master/docs/README.md to build spark docs,but it hangs on: SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See