Hi,

I'm currently working on Spark, HBase-Setup which processes log files (~10
GB/day). These log files are persisted hourly on n > 10 application servers
and copied to a 4 node hdfs.

Our current spark-job aggregates single visits (based on a session-uuid)
across all application-servers on a daily basis. Visits are filtered (only
about 1% of data remains) and stored in an HBase for further processing.

Currently there is no use of the Spark-Streaming API, i.e. a cronjob runs
every day and fires the visit calculation.

Questions
1) Ist it really necessary to store the log files in the HDFS or can spark
somehow read the files from a local file system and distribute the data to
the other nodes? Rationale: The data is (probably) only read once during
the visit calculation which defies the purpose of a dfs.

2) If the raw log files have to be in the HDFS, I have to remove the files
from the HDFS after processing them, so COPY -> PROCESS -> REMOVE. Is this
the way to go?

3) Before I can process a visit for an hour. I have to wait until all log
files of all application servers have been copied to the HDFS. It doesn't
seem like StreamingContext.fileStream can wait for more sophisticated
patterns, e.g. ("context*/logs-2016-08-01-15"). Do you guys have a
recommendation to solve this problem? One possible solution: After the
files have been copied, create an additional file that indicates spark that
all files are available?

If you have any questions, please don't hesitate to ask.

Thanks,
David

Reply via email to